Skip to content

Lesson 2 — Indexed proximity

Lesson 1 established how to compute distance correctly. This lesson is about a different question — proximity. Not “how far apart are these?” but “is this within a given distance?” The two questions sound similar but the queries take different shapes, and the performance characteristics are dramatically different at scale.

The function for proximity is ST_DWithin. It returns a boolean — yes or no, are these two geometries within the given distance of each other. The crucial thing is that it can work with PostGIS’s GIST spatial index to eliminate most candidate pairs without computing actual distances. That’s the function’s superpower.

But there’s a subtle production gotcha. The pattern from Lesson 1 — ::geography casts to get distances in metres — silently disables the index when applied to ST_DWithin. Code that looks correct can run hundreds of times slower than it should.

This lesson covers three things:

  • The basic syntax of ST_DWithin and how to think about it
  • The cast-blocks-index issue — why it happens and how to recognise it in EXPLAIN ANALYZE
  • The hybrid pattern — filter on geometry (indexed), order on geography (precise)

By the end you’ll have the tool that handles the bulk of real-world proximity questions and avoid the trap that catches a lot of teams the first time they write proximity queries at scale.

A — customers near a depot

The simplest possible question. How many customers are within about 5km of the Sydney CBD depot?

In bin/rails dbconsole:

1
2
3
4
SELECT COUNT(*)
FROM customers c, depots d
WHERE d.code = 'SYDCBD'
  AND ST_DWithin(c.location, d.location, 0.05);

The third argument to ST_DWithin is a distance. When the geometries aren’t cast to anything, the distance is in the units of the source SRID — for SRID 4326, that’s degrees.

0.05 degrees is very roughly 5km in Australia. (More precisely: 0.05 degrees of latitude is ~5.5km; 0.05 degrees of longitude at Sydney’s latitude is ~4.6km. We’ll come back to this.)

Output:

 count
-------
   324

Around 324 customers within roughly 5km. To see who they are, ordered by precise distance:

1
2
3
4
5
6
7
SELECT c.name, c.suburb,
       ROUND(ST_Distance(c.location::geography, d.location::geography))::int AS metres
FROM customers c, depots d
WHERE d.code = 'SYDCBD'
  AND ST_DWithin(c.location, d.location, 0.05)
ORDER BY metres
LIMIT 10;

Output:

        name         |   suburb   | metres
---------------------+------------+--------
 Marcus Murphy       | Pyrmont    |    412
 Jasmine Johnson     | Ultimo     |    687
 Oliver Olsen        | Haymarket  |    824
 Patricia Park       | Surry Hills|   1142
 ...

Note we used two different functions for two different purposes:

  • ST_DWithin(c.location, d.location, 0.05) — proximity filter, no cast, uses the spatial index
  • ST_Distance(c.location::geography, d.location::geography) — precise distance computation in metres, used for ordering and display

This combination — geometry filter, geography distance — is the hybrid pattern. Filter wide and indexed; measure precise and accurate. The next sections explain why this matters.

Why ST_DWithin uses the index

A small detour to understand what’s happening under the hood, because the speed advantage isn’t obvious from the syntax.

PostGIS’s spatial index — GIST, in our case index_customers_on_location — stores each geometry’s bounding box: the smallest rectangle that contains the geometry. For a point, the bounding box is just the point itself. For a polygon, it’s the polygon’s outer extent.

A spatial index can rapidly answer “give me all geometries whose bounding boxes intersect this region.” It can’t directly answer “give me geometries within X distance” — but it can answer the related question “give me geometries whose bounding boxes are within a region expanded by X.”

That’s the trick ST_DWithin uses internally. To find geometries within X of a query point:

  1. Expand the query point’s bounding box by X — turning a point into a rectangle of side 2X centred on the point
  2. Use the spatial index to find candidate geometries whose bounding boxes intersect this expanded rectangle
  3. For those candidates only, compute the actual distance and check it against X

Step 2 is the index-driven elimination. It’s fast — even with millions of points, the index can produce candidates in milliseconds. Step 3 is the precise check, but it only runs on the small set of candidates that survived step 2.

You can see this in EXPLAIN ANALYZE output. Run:

1
2
3
4
5
6
EXPLAIN ANALYZE
SELECT b.id, b.name
FROM customers b
WHERE ST_DWithin(b.location, 
                 (SELECT location FROM customers WHERE id = 1), 
                 0.05);

Output (relevant parts):

Index Scan using index_customers_on_location on customers b
  Index Cond: (location && st_expand(..., 0.05))
  Filter: st_dwithin(location, ..., 0.05)
Execution Time: 1.951 ms

The Index Cond: (location && st_expand(..., 0.05)) is the bounding-box trick made literal. PostgreSQL expanded the query point by 0.05, used && (the bounding-box-intersects operator) against the GIST index to find candidates, then applied the precise st_dwithin filter to those candidates.

Compare to the equivalent query using ST_Distance instead:

1
2
3
4
5
EXPLAIN ANALYZE
SELECT b.id, b.name
FROM customers b
WHERE ST_Distance(b.location, 
                  (SELECT location FROM customers WHERE id = 1)) < 0.05;
Seq Scan on customers b
  Filter: (st_distance(location, ...) < '0.05'::double precision)
  Rows Removed by Filter: 2493
Execution Time: 0.976 ms

Sequential scan. PostgreSQL reads every row, computes the distance for each, and filters. No index involvement.

For our 2,500 customers the difference is invisible (both finish in milliseconds). For a million-row table it’s catastrophic — sequential scan grows linearly with row count; index scan grows logarithmically. The gap widens with scale.

The difference is the function shape:

  • ST_DWithin(geom, geom, dist) — a bounding-box predicate. PostgreSQL recognises that “within distance” can be answered by expanding boxes and checking the index.
  • ST_Distance(geom, geom) < dist — a function comparison. PostgreSQL has to evaluate the function on every row before it can compare to dist.

Both produce the same answer. The first lets the index do its job; the second doesn’t.

Why the geography cast blocks the index

This is the production gotcha. Lesson 1 taught the ::geography cast as the standard pattern for distance queries — the cast gives correct metres on the spheroid. Naturally, you might write ST_DWithin the same way:

1
2
3
4
5
SELECT b.id, b.name
FROM customers b
WHERE ST_DWithin(b.location::geography, 
                 (SELECT location::geography FROM customers WHERE id = 1), 
                 5000);

5000 metres — explicit, correct, readable. And the query returns the right answer. But:

1
EXPLAIN ANALYZE [the query above]
Seq Scan on customers b
  Filter: st_dwithin((location)::geography, ..., '5000'::double precision, true)
  Rows Removed by Filter: 2492
Execution Time: 3.776 ms

Sequential scan again. No index. Every row gets the geography cast computed; every row goes through ST_DWithin. The function call is correct; the index just isn’t engaged.

Why? Our spatial index is on location (the geometry type). When you write b.location::geography, the index can’t be used because the predicate is now in geography space. There are technical reasons in how PostGIS represents the bounding boxes used by the index — they’re geometry-level — but the upshot is straightforward: ::geography casts disable the GIST index for ST_DWithin predicates.

You could build a separate index specifically on location::geography (a functional index). It’s possible but adds storage, slows writes, and most production schemas don’t bother. The hybrid pattern is simpler.

The dramatic example

The performance difference is most visible on a self-join. “Find pairs of customers within 200m of each other” runs through 2,500 × 2,500 / 2 ≈ 3 million candidate pairs.

The geography version:

1
2
3
4
5
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM customers a, customers b
WHERE a.id < b.id
  AND ST_DWithin(a.location::geography, b.location::geography, 200);
Nested Loop ... 
  Index Scan using customers_pkey on customers b
    Index Cond: (id > a.id)
    Filter: st_dwithin((a.location)::geography, (location)::geography, '200', true)
Execution Time: 1880.385 ms

Notice — the only index used is customers_pkey (the primary key), to handle a.id < b.id. The spatial index isn’t engaged at all. The query takes about 1.9 seconds.

The geometry version with degree-based distance:

1
2
3
4
5
EXPLAIN ANALYZE
SELECT a.id, b.id
FROM customers a, customers b
WHERE a.id < b.id
  AND ST_DWithin(a.location, b.location, 0.002);
Nested Loop ...
  Index Scan using index_customers_on_location on customers b
    Index Cond: (location && st_expand(a.location, '0.002'))
    Filter: ((a.id < id) AND st_dwithin(a.location, location, '0.002'))
Execution Time: 29.505 ms

Same answer (291 pairs in both cases). The spatial index is engaged. Query takes 29ms — about 64× faster.

That’s the pattern playing out in practice. The geography version looks correct (it is) and produces the right answer, but it falls back to row-by-row evaluation. The geometry version uses the index and is dramatically faster.

The 64× ratio at this scale is dramatic. At million-row scale it’s catastrophic — what’s 1.9 seconds here might be 30 minutes there. This gotcha hides until your data grows, then bites hard.

The hybrid pattern in production

The takeaway is the production-pattern recipe:

1
2
3
4
5
-- For proximity filtering: geometry, no cast, degree distance
ST_DWithin(a.location, b.location, X_in_degrees)

-- For precise distance: geography cast, metres
ST_Distance(a.location::geography, b.location::geography)

Filter with geometry; measure with geography. The filter is fast and indexed; the measurement is precise and correct.

Picking the degree distance. Roughly: 0.001 degrees ≈ 100m, 0.01 degrees ≈ 1km, 0.1 degrees ≈ 10km — these approximations are accurate within ~10% across most populated latitudes. For Australian operations:

Approximate metres Degrees
100m 0.001
500m 0.005
1km 0.01
5km 0.05
10km 0.1
50km 0.5

Use a slightly generous degree distance because of the asymmetry — at Australian latitudes one degree of longitude is shorter than one degree of latitude, so a “circle” in degree space is slightly squished east-west. Picking 0.05 for “within 5km” is fine; the eastern and western corners of the search box will be at ~4.6km, the northern and southern corners at ~5.5km. Some genuine 5km matches might fall outside the box.

If precision matters, the post-filter ordering with ST_Distance(::geography) catches the precise case. The indexed filter selects a comfortable superset; the geography distance prunes to the exact set.

A small service object

Pulling the hybrid pattern into a reusable service.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
module ClosestCustomers
  extend self

  # Returns customers nearest to a depot, with distance in metres.
  # Uses the hybrid filter-on-geometry, order-on-geography pattern.
  #
  # The radius_m parameter is a search bound. It's converted to
  # degrees with a generous multiplier to absorb the geometry
  # filter's elliptical shape at non-equatorial latitudes.
  def call(depot, count: 10, radius_m: 50_000)
    # Convert metres to a generous degree bound. 1 degree ≈ 100km
    # at Australian latitudes; we use 90 to oversize the filter.
    radius_degrees = radius_m / 90_000.0

    rows = ActiveRecord::Base.connection.select_all(<<~SQL, "ClosestCustomers", [depot.id, radius_degrees, count]).rows
      SELECT c.id,
             c.name,
             c.suburb,
             ST_Distance(c.location::geography, d.location::geography) AS distance_m
      FROM customers c, depots d
      WHERE d.id = $1
        AND ST_DWithin(c.location, d.location, $2)
      ORDER BY distance_m
      LIMIT $3
    SQL

    rows.map { |id, name, suburb, distance|
      {
        id:         id,
        name:       name,
        suburb:     suburb,
        distance_m: distance.to_f
      }
    }
  end
end

A few things worth noting.

The hybrid pattern is in the SQL. Filter clause uses geometry (ST_DWithin(c.location, d.location, $2)); ORDER BY uses geography (ST_Distance(c.location::geography, d.location::geography)). The filter engages the index; the order produces precise metres.

radius_degrees = radius_m / 90_000.0 — the conversion. 90,000 metres per degree is a deliberate underestimate (real value is ~111,000 m/degree at the equator, less at higher latitudes). Dividing by a smaller number produces a larger degree bound, which oversizes the filter. The geography ordering catches anything genuinely outside the precise metre radius.

Parameterised query. Inputs go through $1, $2, $3 rather than string interpolation. Clean discipline; insulates against any accidental SQL injection if the function gets refactored to take user-supplied input.

Use it from console:

1
2
3
4
5
6
7
8
9
sydney_cbd = Depot.find_by(code: "SYDCBD")
ClosestCustomers.call(sydney_cbd, count: 5)
# => [
#   { id: 1234, name: "Marcus Murphy",   suburb: "Pyrmont",      distance_m: 412.18 },
#   { id: 8901, name: "Jasmine Johnson", suburb: "Ultimo",       distance_m: 687.34 },
#   { id: 5678, name: "Oliver Olsen",    suburb: "Haymarket",    distance_m: 824.56 },
#   { id: 2345, name: "Patricia Park",   suburb: "Surry Hills",  distance_m: 1142.07 },
#   { id: 6789, name: "Quincy Quinn",    suburb: "Chippendale",  distance_m: 1287.43 }
# ]

Same pattern generalises to other proximity queries — “closest depot to a customer,” “nearest job to an FO’s position,” “depot closest to a hypothetical site.” The query shape stays the same; only the data sources change.

What this introduced

Several patterns worth carrying forward:

ST_DWithin for proximity filtering. The function takes two geometries and a distance, returns boolean. Used as a WHERE filter.

The bounding-box trick. Spatial indexes work on bounding boxes. ST_DWithin expands the query geometry’s box by the distance, looks up candidates by box intersection (the && operator), and refines with precise distance. This is how the index can answer proximity queries without computing every distance.

The ::geography cast disables the index. A subtle and important gotcha. The cast is right for measurement (Lesson

  1. but wrong for filtering. When you see EXPLAIN ANALYZE showing a Seq Scan on a ST_DWithin query, the geography cast is almost always why.

The hybrid pattern: filter on geometry, order on geography. The production-grade idiom. Filter with ST_DWithin(geom, geom, degrees) for index-friendly proximity; order with ST_Distance(::geography, ::geography) for accurate metres. Filter wide and indexed; measure precise.

Reading EXPLAIN ANALYZE for index engagement. “Index Scan using index_…” with a bounding-box Index Cond means the spatial index is doing work. “Seq Scan … Filter: st_dwithin” means it isn’t. A small habit to develop — glance at the plan whenever a proximity query feels slow.

Where this leaves us

Lessons 1 and 2 cover the foundational distance and proximity patterns, with the awareness of when the ::geography cast is right and when it’s wrong. Lesson 3 takes them somewhere more substantial: a real analytical question that requires combining spatial distance with another SQL pattern most Rails developers don’t reach for often — window functions.

The question: how far do field officers travel between consecutive jobs? Each FO completes jobs in some sequence ordered by time; consecutive jobs have locations; the distance between consecutive job locations is travel distance. Aggregating per FO over a time window gives “total distance travelled” — a metric for dispatch efficiency.

The query needs LAG(), a window function that lets each row see its predecessor in an ordered partition. Combined with our distance tools, it produces real travel distances. The lesson explores both the SQL pattern and the operational question.