Skip to content

Lesson 1 — Distance and projection

Module 6 closed with the dispatch deck producing real operational reports. Module 7 takes a different tack. The focus shifts from visualising what we have to computing new insights from the geographic data already in the database.

Most lessons in this module won’t introduce a new map. Some won’t even introduce a new view. The substance is what PostGIS can compute — the answers it gives to the questions you ask. Sometimes the right output is a sortable table. Sometimes a chart. Sometimes just exploration in bin/rails dbconsole.

This first lesson covers what’s probably the single most common source of confusion in PostGIS work: distance.

The naive question “how far apart are these two points?” has a trickier answer than you might expect. Get it wrong, and queries return numbers that are silently nonsensical — Cartesian distance in degrees of latitude, when what you wanted was metres on the surface of the Earth. The answer is a real number; it’s just the wrong real number, and there’s no error message.

This lesson untangles that — when distance is degrees, when it’s metres, why the difference exists, and how to make sure you get what you intended.

A motivating question

A simple question for the dispatch deck: how far is each customer from the Sydney CBD depot?

Customers and depots both have location columns — point geometries, geocoded to real addresses. No relationships needed; just two tables and the question of distance between their points.

In bin/rails dbconsole, the most direct query you’d write is:

1
2
3
4
5
SELECT c.name AS customer,
       ST_Distance(c.location, d.location) AS distance
FROM customers c, depots d
WHERE d.code = 'SYDCBD'
LIMIT 10;

(d.code is the depot’s short identifier; SYDCBD is the Sydney CBD depot. Substitute your own depot codes if they differ.)

Run that. The output looks something like:

       customer        |     distance
-----------------------+-------------------
 Alice Anderson        | 0.0824931...
 Brian Brown           | 0.1421054...
 Catherine Chen        | 0.4156782...
 David Davis           | 0.0398761...
 Eleanor Edwards       | 0.0712347...

Hmm. The numbers are real but they’re certainly not metres, and they’re not kilometres either. They’re somewhere around 0.04 to 0.4.

Those are degrees.

Why degrees?

To explain, we need a small detour into how PostgreSQL stores spatial data.

The location column on customers is a geometry(Point, 4326). Two parts to that:

  • Point — the geometry type. A 2D point with x and y coordinates.
  • 4326 — the SRID, or Spatial Reference Identifier. Identifies the coordinate system the point’s coordinates refer to.

SRID 4326 is WGS84 — the coordinate system used by GPS, by most basemaps on the web, and by the vast majority of geographic data exchanged across the internet. In WGS84, the coordinates of a point are longitude (a number between -180 and 180) and latitude (a number between -90 and 90). Both are angles, measured in degrees, against a reference centred on Earth’s centre of mass.

When ST_Distance is asked to compute distance between two geometry objects, it does so by treating the coordinates as Cartesian numbers in the source coordinate system. It uses ordinary Pythagoras’ theorem:

distance = √((x₁ - x₂)² + (y₁ - y₂)²)

Two points at lng 151.2 / lat -33.9 and lng 151.3 / lat -33.85 produce:

√((151.2 - 151.3)² + (-33.9 - -33.85)²)
= √(0.01 + 0.0025)
≈ 0.1118

The number is real. It’s the Cartesian distance between two points whose coordinates are interpreted as flat numbers in 2D space. It’s not a meaningful real-world distance.

Two reasons it’s not meaningful:

One degree of longitude is not the same length as one degree of latitude. At the equator, both are about 111km. But longitude lines converge at the poles — at Sydney’s latitude (~33° south), one degree of longitude is about 93km, while one degree of latitude remains 111km. Treating them as equal — which Pythagoras does — distorts the geometry.

The Earth is curved. Pythagoras works on flat planes. For short distances the error is small, but for longer distances it accumulates. A “straight line” on the Earth’s surface is actually a curve (a great circle); ordinary 2D Pythagoras doesn’t account for that curvature.

So the 0.1118 result above isn’t a useful distance for operational purposes. It’s the Cartesian distance in degrees — a kind of mathematical artefact that happens to be a real number but doesn’t translate back to anything physical.

The fix: geography

PostGIS distinguishes two spatial types that look similar but behave differently:

  • geometry — flat, Cartesian, fast. Operations treat coordinates as plain numbers in 2D space.
  • geography — curved, spheroidal, slower. Operations treat coordinates as positions on the WGS84 spheroid (the shape the Earth approximates).

ST_Distance works on both, with different behaviour:

  • ST_Distance(geometry, geometry) returns Cartesian distance in the units of the source SRID (degrees, for 4326).
  • ST_Distance(geography, geography) returns metres on the WGS84 spheroid.

To get a meaningful distance answer for our customer-depot query, we cast each point to geography:

1
2
3
4
5
SELECT c.name AS customer,
       ST_Distance(c.location::geography, d.location::geography) AS distance_m
FROM customers c, depots d
WHERE d.code = 'SYDCBD'
LIMIT 10;

Output:

       customer        |   distance_m
-----------------------+----------------
 Alice Anderson        |  9234.518...
 Brian Brown           | 15842.107...
 Catherine Chen        | 46916.443...
 David Davis           |  4421.092...
 Eleanor Edwards       |  7945.881...

Now the numbers are metres — Alice is about 9.2km from Sydney CBD depot, Catherine is about 47km. These match real-world reasoning.

The ::geography cast is the workhorse syntax. It’s shorthand for CAST(... AS geography). PostGIS sees it, converts the geometry’s coordinates to spheroid positions internally, and uses spheroidal mathematics for the distance computation.

A more useful question

The query above pairs every customer with the Sydney CBD depot. Useful for one depot, but the more interesting operational question is: for each customer, how far is the nearest depot?

That query needs a cross-join (every customer paired with every depot) plus a per-customer aggregation (the minimum distance):

1
2
3
4
5
6
SELECT c.name AS customer,
       ROUND(MIN(ST_Distance(c.location::geography, d.location::geography)) / 1000)::int AS nearest_depot_km
FROM customers c, depots d
GROUP BY c.id, c.name
ORDER BY nearest_depot_km DESC
LIMIT 10;

Output (the 10 customers furthest from any depot):

        customer          | nearest_depot_km
--------------------------+------------------
 Marcus Murphy            |              412
 Jasmine Johnson          |              387
 Oliver Olsen             |              298
 ...

Customers hundreds of kilometres from any depot. In real operations this would prompt questions — should we open a depot in their region? Are these legitimate remote-area customers we accept higher cost to serve?

A few things worth understanding about this query.

FROM customers c, depots d is a cross-join. Every customer is paired with every depot — 3,000 customers × 9 depots = 27,000 rows in the intermediate result. PostgreSQL runs this comfortably. We then aggregate to one row per customer with GROUP BY c.id.

MIN(ST_Distance(...)) picks the smallest distance for each customer’s group of 9 depot-pairings. The result is the nearest-depot distance.

ROUND(... / 1000)::int converts metres to kilometres and integer-rounds for readability. The ::int cast is because ROUND on double precision returns double precision by default; explicit casting to int produces the clean integer output. Without it the result would be 412.0 rather than 412.

ORDER BY ... DESC LIMIT 10 picks the worst cases — customers furthest from any depot. For dispatch operations these are the interesting outliers; sorted ascending, you’d see customers within metres of a depot (mostly Sydney CBD addresses).

The cost of geography

Spheroidal calculations aren’t free. PostGIS computes distances on the WGS84 spheroid using more sophisticated mathematics than Pythagoras — typically Vincenty’s formula, which iterates to converge on an accurate distance. For a single query this is fast. For aggregation queries that compute distances for thousands of pairs, the cost adds up.

A rough comparison. Run both with EXPLAIN ANALYZE:

1
2
3
4
EXPLAIN ANALYZE
SELECT MIN(ST_Distance(c.location, d.location))
FROM customers c, depots d
GROUP BY c.id;
1
2
3
4
EXPLAIN ANALYZE
SELECT MIN(ST_Distance(c.location::geography, d.location::geography))
FROM customers c, depots d
GROUP BY c.id;

The geometry version typically runs in 5-10ms for the 27K pairs. The geography version runs in 30-80ms. About 5-8× slower — though both are still fast enough for reporting work.

The difference matters when you’re aggregating millions of distances. A real-world routing problem might compute millions of pairwise distances; the difference between “5 seconds” and “40 seconds” matters there.

For most reporting queries on data of our scale (thousands, tens of thousands of rows), the cost is negligible and the correctness benefit is enormous. Default to geography for distance work; switch to geometry only when you’ve measured that the speed cost matters.

geography vs geometry for storage

Should the location columns be geography then, instead of geometry? It’s a real architectural choice with a few trade-offs.

Storing as geometry(Point, 4326) (what we have):

  • Spatial indexes are GIST on geometry — fast for proximity queries (Lesson 2 covers this in depth)
  • Many PostGIS functions work only on geometry; geography is a smaller API surface
  • Casting to geography for distance is one extra ::geography per query
  • Coordinates stored as planar values; very small storage footprint

Storing as geography(Point, 4326):

  • Distance, length, area calculations are correct by default — no cast needed
  • Spatial indexes can also be GIST on geography
  • Some operations (buffers, intersections) work differently — geography is for measurement on the spheroid; geometry is for operations in a coordinate plane
  • Slightly larger storage per point

The pragmatic answer: store as geometry, cast to geography for distance work. This is what most production PostGIS schemas do. The geometry storage is universal; the cast is a small, explicit, intentional act. The alternative (storing as geography) creates surprises when you need geometry-only operations — buffers, intersections, etc. — and have to cast back.

A small reporting query

Pulling this together into something useful. A query that shows, for each depot, how many customers it’s the nearest depot for, and the average distance for those customers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
WITH nearest AS (
  SELECT c.id AS customer_id,
         d.id AS depot_id,
         d.name AS depot_name,
         ST_Distance(c.location::geography, d.location::geography) AS distance_m,
         RANK() OVER (
           PARTITION BY c.id
           ORDER BY ST_Distance(c.location::geography, d.location::geography)
         ) AS rk
  FROM customers c, depots d
)
SELECT depot_name,
       COUNT(*) AS nearest_customer_count,
       ROUND(AVG(distance_m / 1000)::numeric, 1) AS avg_km,
       ROUND(MAX(distance_m / 1000)::numeric, 1) AS max_km
FROM nearest
WHERE rk = 1
GROUP BY depot_name
ORDER BY nearest_customer_count DESC;

Output:

   depot_name   | nearest_customer_count | avg_km | max_km
----------------+------------------------+--------+--------
 Sydney CBD     |                    824 |    9.8 |   42.6
 Melbourne      |                    512 |   16.4 |   78.2
 Brisbane       |                    387 |   23.1 |  127.6
 Perth          |                    298 |   62.3 |  287.4
 Adelaide       |                    192 |   34.5 |  121.6
 Newcastle      |                    142 |   18.7 |   58.4
 Canberra       |                     98 |   29.4 |   89.7
 Hobart         |                     78 |   31.4 |   84.2
 Darwin         |                     56 |   89.2 |  187.4

A real piece of operational intelligence. Each row says “this many customers have this depot as their nearest, and the typical / maximum distance to that depot.” Sydney CBD serves the most customers as nearest depot, with an average distance of just under 10km. Perth’s customers average 62km from their nearest depot — geographically much more spread, reflecting Western Australia’s lower population density. Darwin customers are 89km on average.

The WITH clause (a Common Table Expression) computes a named intermediate result we can query. Inside, RANK() OVER (PARTITION BY c.id ORDER BY distance) gives each customer- depot pair a rank from 1 (nearest depot) to 9 (furthest). Filtering WHERE rk = 1 keeps just the nearest pairing per customer. Then we group by depot and aggregate.

This is the kind of query the dispatch deck could expose as a report. We won’t build that page in this lesson — the substance here is the geography cast and the SRID conversation. But the query above is exactly what would feed such a page.

Putting it in a service object

For completeness, the same logic as a Ruby service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
module DepotProximitySummary
  extend self

  def call
    rows = ActiveRecord::Base.connection.select_all(<<~SQL).rows
      WITH nearest AS (
        SELECT c.id AS customer_id,
               d.id AS depot_id,
               d.name AS depot_name,
               ST_Distance(c.location::geography, d.location::geography) AS distance_m,
               RANK() OVER (
                 PARTITION BY c.id
                 ORDER BY ST_Distance(c.location::geography, d.location::geography)
               ) AS rk
        FROM customers c, depots d
      )
      SELECT depot_name,
             COUNT(*) AS nearest_customer_count,
             ROUND(AVG(distance_m / 1000)::numeric, 1) AS avg_km,
             ROUND(MAX(distance_m / 1000)::numeric, 1) AS max_km
      FROM nearest
      WHERE rk = 1
      GROUP BY depot_name
      ORDER BY nearest_customer_count DESC
    SQL

    rows.map { |name, count, avg, max|
      {
        depot_name:             name,
        nearest_customer_count: count.to_i,
        avg_km:                 avg&.to_f,
        max_km:                 max&.to_f
      }
    }
  end
end

Same pattern as the service objects from Module 6. The ::geography cast is in the SQL; the Ruby just shapes the result. A controller could render this; a Phlex component could format it as a table. We’ll come back to building the view in later lessons when we have more depth to display.

What this introduced

Several things that recur throughout serious PostGIS work:

SRIDs as identifiers. Every spatial object has one. It specifies what its coordinates mean. SRID 4326 (WGS84) is the lingua franca; most data you encounter uses it.

geometry vs geography types. The same data, two different mathematical regimes. Geometry treats coordinates as flat 2D; geography treats them as positions on the spheroid. Operations on each return different results, often in different units.

The ::geography cast as the standard distance idiom. Not casting silently produces wrong answers. The cast is explicit, deliberate, and a tell-tale sign in code that spatial intent has been considered.

Performance trade-offs are real but usually negligible at operational scale. Geography distance is 5-8× slower than geometry distance. For thousands of pairs, both are fast. For millions of pairs, the difference matters.

Storage as geometry, computation as geography. The production-PostGIS norm. Geometry indexes are smaller and serve the broadest range of queries; explicit casts to geography happen only where measurement is the question.

Window functions for per-row ranking. RANK() OVER (PARTITION BY ... ORDER BY ...) is the SQL pattern for “give each row a rank within its group.” Used here for nearest-neighbour selection; a useful pattern to keep in your toolkit.

Where this leaves us

Lesson 1 establishes the foundational distinction every PostGIS user needs to understand. With it in place, the next lesson takes the same data and asks a different shape of question: find me everything within Nkm of a given point.

That query uses ST_DWithin — a function that’s interesting both for its API (clean, declarative) and for its performance characteristics. ST_DWithin uses spatial indexes; the “compute all distances and filter” approach we used here doesn’t. For the customer-depot query, both are fast (small data). For “all jobs within 5km of any FO assignment area” the difference between them is several orders of magnitude.