Lesson 1 — Distance and projection
Module 6 closed with the dispatch deck producing real operational reports. Module 7 takes a different tack. The focus shifts from visualising what we have to computing new insights from the geographic data already in the database.
Most lessons in this module won’t introduce a new map. Some
won’t even introduce a new view. The substance is what
PostGIS can compute — the answers it gives to the questions
you ask. Sometimes the right output is a sortable table.
Sometimes a chart. Sometimes just exploration in
bin/rails dbconsole.
This first lesson covers what’s probably the single most common source of confusion in PostGIS work: distance.
The naive question “how far apart are these two points?” has a trickier answer than you might expect. Get it wrong, and queries return numbers that are silently nonsensical — Cartesian distance in degrees of latitude, when what you wanted was metres on the surface of the Earth. The answer is a real number; it’s just the wrong real number, and there’s no error message.
This lesson untangles that — when distance is degrees, when it’s metres, why the difference exists, and how to make sure you get what you intended.
A motivating question
A simple question for the dispatch deck: how far is each customer from the Sydney CBD depot?
Customers and depots both have location columns — point
geometries, geocoded to real addresses. No relationships
needed; just two tables and the question of distance between
their points.
In bin/rails dbconsole, the most direct query you’d write
is:
|
|
(d.code is the depot’s short identifier; SYDCBD is the
Sydney CBD depot. Substitute your own depot codes if they
differ.)
Run that. The output looks something like:
customer | distance
-----------------------+-------------------
Alice Anderson | 0.0824931...
Brian Brown | 0.1421054...
Catherine Chen | 0.4156782...
David Davis | 0.0398761...
Eleanor Edwards | 0.0712347...Hmm. The numbers are real but they’re certainly not metres, and they’re not kilometres either. They’re somewhere around 0.04 to 0.4.
Those are degrees.
Why degrees?
To explain, we need a small detour into how PostgreSQL stores spatial data.
The location column on customers is a geometry(Point, 4326). Two parts to that:
Point— the geometry type. A 2D point with x and y coordinates.4326— the SRID, or Spatial Reference Identifier. Identifies the coordinate system the point’s coordinates refer to.
SRID 4326 is WGS84 — the coordinate system used by GPS, by most basemaps on the web, and by the vast majority of geographic data exchanged across the internet. In WGS84, the coordinates of a point are longitude (a number between -180 and 180) and latitude (a number between -90 and 90). Both are angles, measured in degrees, against a reference centred on Earth’s centre of mass.
When ST_Distance is asked to compute distance between two
geometry objects, it does so by treating the coordinates as
Cartesian numbers in the source coordinate system. It uses
ordinary Pythagoras’ theorem:
distance = √((x₁ - x₂)² + (y₁ - y₂)²)Two points at lng 151.2 / lat -33.9 and lng 151.3 / lat -33.85 produce:
√((151.2 - 151.3)² + (-33.9 - -33.85)²)
= √(0.01 + 0.0025)
≈ 0.1118The number is real. It’s the Cartesian distance between two points whose coordinates are interpreted as flat numbers in 2D space. It’s not a meaningful real-world distance.
Two reasons it’s not meaningful:
One degree of longitude is not the same length as one degree of latitude. At the equator, both are about 111km. But longitude lines converge at the poles — at Sydney’s latitude (~33° south), one degree of longitude is about 93km, while one degree of latitude remains 111km. Treating them as equal — which Pythagoras does — distorts the geometry.
The Earth is curved. Pythagoras works on flat planes. For short distances the error is small, but for longer distances it accumulates. A “straight line” on the Earth’s surface is actually a curve (a great circle); ordinary 2D Pythagoras doesn’t account for that curvature.
So the 0.1118 result above isn’t a useful distance for operational purposes. It’s the Cartesian distance in degrees — a kind of mathematical artefact that happens to be a real number but doesn’t translate back to anything physical.
The fix: geography
PostGIS distinguishes two spatial types that look similar but behave differently:
geometry— flat, Cartesian, fast. Operations treat coordinates as plain numbers in 2D space.geography— curved, spheroidal, slower. Operations treat coordinates as positions on the WGS84 spheroid (the shape the Earth approximates).
ST_Distance works on both, with different behaviour:
ST_Distance(geometry, geometry)returns Cartesian distance in the units of the source SRID (degrees, for 4326).ST_Distance(geography, geography)returns metres on the WGS84 spheroid.
To get a meaningful distance answer for our customer-depot
query, we cast each point to geography:
|
|
Output:
customer | distance_m
-----------------------+----------------
Alice Anderson | 9234.518...
Brian Brown | 15842.107...
Catherine Chen | 46916.443...
David Davis | 4421.092...
Eleanor Edwards | 7945.881...Now the numbers are metres — Alice is about 9.2km from Sydney CBD depot, Catherine is about 47km. These match real-world reasoning.
The ::geography cast is the workhorse syntax. It’s
shorthand for CAST(... AS geography). PostGIS sees it,
converts the geometry’s coordinates to spheroid positions
internally, and uses spheroidal mathematics for the
distance computation.
A more useful question
The query above pairs every customer with the Sydney CBD depot. Useful for one depot, but the more interesting operational question is: for each customer, how far is the nearest depot?
That query needs a cross-join (every customer paired with every depot) plus a per-customer aggregation (the minimum distance):
|
|
Output (the 10 customers furthest from any depot):
customer | nearest_depot_km
--------------------------+------------------
Marcus Murphy | 412
Jasmine Johnson | 387
Oliver Olsen | 298
...Customers hundreds of kilometres from any depot. In real operations this would prompt questions — should we open a depot in their region? Are these legitimate remote-area customers we accept higher cost to serve?
A few things worth understanding about this query.
FROM customers c, depots d is a cross-join. Every
customer is paired with every depot — 3,000 customers × 9
depots = 27,000 rows in the intermediate result. PostgreSQL
runs this comfortably. We then aggregate to one row per
customer with GROUP BY c.id.
MIN(ST_Distance(...)) picks the smallest distance for
each customer’s group of 9 depot-pairings. The result is the
nearest-depot distance.
ROUND(... / 1000)::int converts metres to kilometres
and integer-rounds for readability. The ::int cast is
because ROUND on double precision returns double precision by default; explicit casting to int produces the
clean integer output. Without it the result would be
412.0 rather than 412.
ORDER BY ... DESC LIMIT 10 picks the worst cases —
customers furthest from any depot. For dispatch operations
these are the interesting outliers; sorted ascending, you’d
see customers within metres of a depot (mostly Sydney CBD
addresses).
The cost of geography
Spheroidal calculations aren’t free. PostGIS computes distances on the WGS84 spheroid using more sophisticated mathematics than Pythagoras — typically Vincenty’s formula, which iterates to converge on an accurate distance. For a single query this is fast. For aggregation queries that compute distances for thousands of pairs, the cost adds up.
A rough comparison. Run both with EXPLAIN ANALYZE:
|
|
|
|
The geometry version typically runs in 5-10ms for the 27K pairs. The geography version runs in 30-80ms. About 5-8× slower — though both are still fast enough for reporting work.
The difference matters when you’re aggregating millions of distances. A real-world routing problem might compute millions of pairwise distances; the difference between “5 seconds” and “40 seconds” matters there.
For most reporting queries on data of our scale (thousands,
tens of thousands of rows), the cost is negligible and the
correctness benefit is enormous. Default to geography for
distance work; switch to geometry only when you’ve measured
that the speed cost matters.
geography vs geometry for storage
Should the location columns be geography then, instead
of geometry? It’s a real architectural choice with a few
trade-offs.
Storing as geometry(Point, 4326) (what we have):
- Spatial indexes are GIST on geometry — fast for proximity queries (Lesson 2 covers this in depth)
- Many PostGIS functions work only on geometry; geography is a smaller API surface
- Casting to
geographyfor distance is one extra::geographyper query - Coordinates stored as planar values; very small storage footprint
Storing as geography(Point, 4326):
- Distance, length, area calculations are correct by default — no cast needed
- Spatial indexes can also be GIST on geography
- Some operations (buffers, intersections) work differently — geography is for measurement on the spheroid; geometry is for operations in a coordinate plane
- Slightly larger storage per point
The pragmatic answer: store as geometry, cast to
geography for distance work. This is what most production
PostGIS schemas do. The geometry storage is universal; the
cast is a small, explicit, intentional act. The alternative
(storing as geography) creates surprises when you need
geometry-only operations — buffers, intersections, etc. —
and have to cast back.
A small reporting query
Pulling this together into something useful. A query that shows, for each depot, how many customers it’s the nearest depot for, and the average distance for those customers:
|
|
Output:
depot_name | nearest_customer_count | avg_km | max_km
----------------+------------------------+--------+--------
Sydney CBD | 824 | 9.8 | 42.6
Melbourne | 512 | 16.4 | 78.2
Brisbane | 387 | 23.1 | 127.6
Perth | 298 | 62.3 | 287.4
Adelaide | 192 | 34.5 | 121.6
Newcastle | 142 | 18.7 | 58.4
Canberra | 98 | 29.4 | 89.7
Hobart | 78 | 31.4 | 84.2
Darwin | 56 | 89.2 | 187.4A real piece of operational intelligence. Each row says “this many customers have this depot as their nearest, and the typical / maximum distance to that depot.” Sydney CBD serves the most customers as nearest depot, with an average distance of just under 10km. Perth’s customers average 62km from their nearest depot — geographically much more spread, reflecting Western Australia’s lower population density. Darwin customers are 89km on average.
The WITH clause (a Common Table Expression) computes a
named intermediate result we can query. Inside, RANK() OVER (PARTITION BY c.id ORDER BY distance) gives each customer-
depot pair a rank from 1 (nearest depot) to 9 (furthest).
Filtering WHERE rk = 1 keeps just the nearest pairing per
customer. Then we group by depot and aggregate.
This is the kind of query the dispatch deck could expose
as a report. We won’t build that page in this lesson — the
substance here is the geography cast and the SRID
conversation. But the query above is exactly what would feed
such a page.
Putting it in a service object
For completeness, the same logic as a Ruby service:
|
|
Same pattern as the service objects from Module 6. The
::geography cast is in the SQL; the Ruby just shapes the
result. A controller could render this; a Phlex component
could format it as a table. We’ll come back to building the
view in later lessons when we have more depth to display.
What this introduced
Several things that recur throughout serious PostGIS work:
SRIDs as identifiers. Every spatial object has one. It specifies what its coordinates mean. SRID 4326 (WGS84) is the lingua franca; most data you encounter uses it.
geometry vs geography types. The same data, two
different mathematical regimes. Geometry treats coordinates
as flat 2D; geography treats them as positions on the
spheroid. Operations on each return different results, often
in different units.
The ::geography cast as the standard distance idiom.
Not casting silently produces wrong answers. The cast is
explicit, deliberate, and a tell-tale sign in code that
spatial intent has been considered.
Performance trade-offs are real but usually negligible at operational scale. Geography distance is 5-8× slower than geometry distance. For thousands of pairs, both are fast. For millions of pairs, the difference matters.
Storage as geometry, computation as geography. The production-PostGIS norm. Geometry indexes are smaller and serve the broadest range of queries; explicit casts to geography happen only where measurement is the question.
Window functions for per-row ranking. RANK() OVER (PARTITION BY ... ORDER BY ...) is the SQL pattern for
“give each row a rank within its group.” Used here for
nearest-neighbour selection; a useful pattern to keep in
your toolkit.
Where this leaves us
Lesson 1 establishes the foundational distinction every PostGIS user needs to understand. With it in place, the next lesson takes the same data and asks a different shape of question: find me everything within Nkm of a given point.
That query uses ST_DWithin — a function that’s interesting
both for its API (clean, declarative) and for its performance
characteristics. ST_DWithin uses spatial indexes; the
“compute all distances and filter” approach we used here
doesn’t. For the customer-depot query, both are fast (small
data). For “all jobs within 5km of any FO assignment area”
the difference between them is several orders of magnitude.