Lesson 2 — Indexed proximity
Lesson 1 established how to compute distance correctly. This lesson is about a different question — proximity. Not “how far apart are these?” but “is this within a given distance?” The two questions sound similar but the queries take different shapes, and the performance characteristics are dramatically different at scale.
The function for proximity is ST_DWithin. It returns a
boolean — yes or no, are these two geometries within the
given distance of each other. The crucial thing is that it
can work with PostGIS’s GIST spatial index to eliminate most
candidate pairs without computing actual distances. That’s
the function’s superpower.
But there’s a subtle production gotcha. The pattern from
Lesson 1 — ::geography casts to get distances in metres —
silently disables the index when applied to ST_DWithin.
Code that looks correct can run hundreds of times slower
than it should.
This lesson covers three things:
- The basic syntax of
ST_DWithinand how to think about it - The cast-blocks-index issue — why it happens and how to
recognise it in
EXPLAIN ANALYZE - The hybrid pattern — filter on geometry (indexed), order on geography (precise)
By the end you’ll have the tool that handles the bulk of real-world proximity questions and avoid the trap that catches a lot of teams the first time they write proximity queries at scale.
A — customers near a depot
The simplest possible question. How many customers are within about 5km of the Sydney CBD depot?
In bin/rails dbconsole:
|
|
The third argument to ST_DWithin is a distance. When the
geometries aren’t cast to anything, the distance is in the
units of the source SRID — for SRID 4326, that’s degrees.
0.05 degrees is very roughly 5km in Australia. (More
precisely: 0.05 degrees of latitude is ~5.5km; 0.05 degrees
of longitude at Sydney’s latitude is ~4.6km. We’ll come back
to this.)
Output:
count
-------
324Around 324 customers within roughly 5km. To see who they are, ordered by precise distance:
|
|
Output:
name | suburb | metres
---------------------+------------+--------
Marcus Murphy | Pyrmont | 412
Jasmine Johnson | Ultimo | 687
Oliver Olsen | Haymarket | 824
Patricia Park | Surry Hills| 1142
...Note we used two different functions for two different purposes:
ST_DWithin(c.location, d.location, 0.05)— proximity filter, no cast, uses the spatial indexST_Distance(c.location::geography, d.location::geography)— precise distance computation in metres, used for ordering and display
This combination — geometry filter, geography distance — is the hybrid pattern. Filter wide and indexed; measure precise and accurate. The next sections explain why this matters.
Why ST_DWithin uses the index
A small detour to understand what’s happening under the hood, because the speed advantage isn’t obvious from the syntax.
PostGIS’s spatial index — GIST, in our case
index_customers_on_location — stores each geometry’s
bounding box: the smallest rectangle that contains the
geometry. For a point, the bounding box is just the point
itself. For a polygon, it’s the polygon’s outer extent.
A spatial index can rapidly answer “give me all geometries whose bounding boxes intersect this region.” It can’t directly answer “give me geometries within X distance” — but it can answer the related question “give me geometries whose bounding boxes are within a region expanded by X.”
That’s the trick ST_DWithin uses internally. To find
geometries within X of a query point:
- Expand the query point’s bounding box by X — turning a point into a rectangle of side 2X centred on the point
- Use the spatial index to find candidate geometries whose bounding boxes intersect this expanded rectangle
- For those candidates only, compute the actual distance and check it against X
Step 2 is the index-driven elimination. It’s fast — even with millions of points, the index can produce candidates in milliseconds. Step 3 is the precise check, but it only runs on the small set of candidates that survived step 2.
You can see this in EXPLAIN ANALYZE output. Run:
|
|
Output (relevant parts):
Index Scan using index_customers_on_location on customers b
Index Cond: (location && st_expand(..., 0.05))
Filter: st_dwithin(location, ..., 0.05)
Execution Time: 1.951 msThe Index Cond: (location && st_expand(..., 0.05)) is the
bounding-box trick made literal. PostgreSQL expanded the
query point by 0.05, used && (the bounding-box-intersects
operator) against the GIST index to find candidates, then
applied the precise st_dwithin filter to those candidates.
Compare to the equivalent query using ST_Distance instead:
|
|
Seq Scan on customers b
Filter: (st_distance(location, ...) < '0.05'::double precision)
Rows Removed by Filter: 2493
Execution Time: 0.976 msSequential scan. PostgreSQL reads every row, computes the distance for each, and filters. No index involvement.
For our 2,500 customers the difference is invisible (both finish in milliseconds). For a million-row table it’s catastrophic — sequential scan grows linearly with row count; index scan grows logarithmically. The gap widens with scale.
The difference is the function shape:
ST_DWithin(geom, geom, dist)— a bounding-box predicate. PostgreSQL recognises that “within distance” can be answered by expanding boxes and checking the index.ST_Distance(geom, geom) < dist— a function comparison. PostgreSQL has to evaluate the function on every row before it can compare todist.
Both produce the same answer. The first lets the index do its job; the second doesn’t.
Why the geography cast blocks the index
This is the production gotcha. Lesson 1 taught the
::geography cast as the standard pattern for distance
queries — the cast gives correct metres on the spheroid.
Naturally, you might write ST_DWithin the same way:
|
|
5000 metres — explicit, correct, readable. And the query returns the right answer. But:
|
|
Seq Scan on customers b
Filter: st_dwithin((location)::geography, ..., '5000'::double precision, true)
Rows Removed by Filter: 2492
Execution Time: 3.776 msSequential scan again. No index. Every row gets the
geography cast computed; every row goes through ST_DWithin.
The function call is correct; the index just isn’t engaged.
Why? Our spatial index is on location (the geometry type).
When you write b.location::geography, the index can’t be
used because the predicate is now in geography space. There
are technical reasons in how PostGIS represents the bounding
boxes used by the index — they’re geometry-level — but the
upshot is straightforward: ::geography casts disable the
GIST index for ST_DWithin predicates.
You could build a separate index specifically on
location::geography (a functional index). It’s possible
but adds storage, slows writes, and most production schemas
don’t bother. The hybrid pattern is simpler.
The dramatic example
The performance difference is most visible on a self-join. “Find pairs of customers within 200m of each other” runs through 2,500 × 2,500 / 2 ≈ 3 million candidate pairs.
The geography version:
|
|
Nested Loop ...
Index Scan using customers_pkey on customers b
Index Cond: (id > a.id)
Filter: st_dwithin((a.location)::geography, (location)::geography, '200', true)
Execution Time: 1880.385 msNotice — the only index used is customers_pkey (the
primary key), to handle a.id < b.id. The spatial index
isn’t engaged at all. The query takes about 1.9 seconds.
The geometry version with degree-based distance:
|
|
Nested Loop ...
Index Scan using index_customers_on_location on customers b
Index Cond: (location && st_expand(a.location, '0.002'))
Filter: ((a.id < id) AND st_dwithin(a.location, location, '0.002'))
Execution Time: 29.505 msSame answer (291 pairs in both cases). The spatial index is engaged. Query takes 29ms — about 64× faster.
That’s the pattern playing out in practice. The geography version looks correct (it is) and produces the right answer, but it falls back to row-by-row evaluation. The geometry version uses the index and is dramatically faster.
The 64× ratio at this scale is dramatic. At million-row scale it’s catastrophic — what’s 1.9 seconds here might be 30 minutes there. This gotcha hides until your data grows, then bites hard.
The hybrid pattern in production
The takeaway is the production-pattern recipe:
|
|
Filter with geometry; measure with geography. The filter is fast and indexed; the measurement is precise and correct.
Picking the degree distance. Roughly: 0.001 degrees ≈ 100m, 0.01 degrees ≈ 1km, 0.1 degrees ≈ 10km — these approximations are accurate within ~10% across most populated latitudes. For Australian operations:
| Approximate metres | Degrees |
|---|---|
| 100m | 0.001 |
| 500m | 0.005 |
| 1km | 0.01 |
| 5km | 0.05 |
| 10km | 0.1 |
| 50km | 0.5 |
Use a slightly generous degree distance because of the
asymmetry — at Australian latitudes one degree of longitude
is shorter than one degree of latitude, so a “circle” in
degree space is slightly squished east-west. Picking
0.05 for “within 5km” is fine; the eastern and western
corners of the search box will be at ~4.6km, the northern
and southern corners at ~5.5km. Some genuine 5km matches
might fall outside the box.
If precision matters, the post-filter ordering with
ST_Distance(::geography) catches the precise case. The
indexed filter selects a comfortable superset; the geography
distance prunes to the exact set.
A small service object
Pulling the hybrid pattern into a reusable service.
|
|
A few things worth noting.
The hybrid pattern is in the SQL. Filter clause uses
geometry (ST_DWithin(c.location, d.location, $2)); ORDER BY
uses geography (ST_Distance(c.location::geography, d.location::geography)). The filter engages the index; the
order produces precise metres.
radius_degrees = radius_m / 90_000.0 — the conversion.
90,000 metres per degree is a deliberate underestimate (real
value is ~111,000 m/degree at the equator, less at higher
latitudes). Dividing by a smaller number produces a larger
degree bound, which oversizes the filter. The geography
ordering catches anything genuinely outside the precise
metre radius.
Parameterised query. Inputs go through $1, $2, $3
rather than string interpolation. Clean discipline; insulates
against any accidental SQL injection if the function gets
refactored to take user-supplied input.
Use it from console:
|
|
Same pattern generalises to other proximity queries — “closest depot to a customer,” “nearest job to an FO’s position,” “depot closest to a hypothetical site.” The query shape stays the same; only the data sources change.
What this introduced
Several patterns worth carrying forward:
ST_DWithin for proximity filtering. The function takes
two geometries and a distance, returns boolean. Used as
a WHERE filter.
The bounding-box trick. Spatial indexes work on bounding
boxes. ST_DWithin expands the query geometry’s box by the
distance, looks up candidates by box intersection (the &&
operator), and refines with precise distance. This is how
the index can answer proximity queries without computing
every distance.
The ::geography cast disables the index. A subtle and
important gotcha. The cast is right for measurement (Lesson
- but wrong for filtering. When you see
EXPLAIN ANALYZEshowing a Seq Scan on aST_DWithinquery, the geography cast is almost always why.
The hybrid pattern: filter on geometry, order on
geography. The production-grade idiom. Filter with
ST_DWithin(geom, geom, degrees) for index-friendly
proximity; order with ST_Distance(::geography, ::geography)
for accurate metres. Filter wide and indexed; measure
precise.
Reading EXPLAIN ANALYZE for index engagement. “Index
Scan using index_…” with a bounding-box Index Cond means
the spatial index is doing work. “Seq Scan … Filter:
st_dwithin” means it isn’t. A small habit to develop —
glance at the plan whenever a proximity query feels slow.
Where this leaves us
Lessons 1 and 2 cover the foundational distance and
proximity patterns, with the awareness of when the
::geography cast is right and when it’s wrong. Lesson 3
takes them somewhere more substantial: a real analytical
question that requires combining spatial distance with
another SQL pattern most Rails developers don’t reach for
often — window functions.
The question: how far do field officers travel between consecutive jobs? Each FO completes jobs in some sequence ordered by time; consecutive jobs have locations; the distance between consecutive job locations is travel distance. Aggregating per FO over a time window gives “total distance travelled” — a metric for dispatch efficiency.
The query needs LAG(), a window function that lets each
row see its predecessor in an ordered partition. Combined
with our distance tools, it produces real travel distances.
The lesson explores both the SQL pattern and the operational
question.