Lesson 3 — Spatial joins and when indexes earn their keep
Every query so far has used relational filtering — WHERE service_area_id = ?, WHERE depot_id = ?. The database joined
through B-tree indexes; PostGIS sat there politely without
being asked to do anything spatial.
This lesson asks PostGIS its first real spatial question: for
each service area, count the jobs inside it. That requires
ST_Contains — geometric intersection between each SA’s
polygon and each job’s point. The database has to compute
geometric containment, not just match foreign keys.
But there’s a subtlety the lesson is going to spend most of its time on. Whether the GIST index helps depends on how broad your report is. A “current state” report, where the active jobs are already a small slice of the data, runs fine without spatial indexing — there’s so little to scan that brute force works. A “total history” report, asking against all 500K jobs with no other filter, needs the GIST index or it grinds.
We’ll write three reports — each progressively broader — and watch the GIST index shift from “doesn’t help” to “essential.” By the end, the reader will understand not just how spatial indexes work but when they matter.
The reports we’ll build
Each is a real dispatch question:
- Current state — pending, scheduled, in-progress jobs by SA. What’s happening right now? About 14k matching rows.
- This year vs last year — jobs by SA filtered by date range, completed or otherwise. About 200-250k matching rows.
- Total history — every job ever, by SA. All 500K rows.
Same SQL shape — LEFT JOIN ... ON ST_Contains(...) aggregated
by SA. Just different WHERE clauses, expressing different
operational questions.
The shape of the query
Before any code, the SQL all three queries share:
|
|
The variable bit is <some condition on jobs>. That’s where the
report’s scope is expressed.
A few things worth knowing about this shape.
LEFT JOIN, not JOIN. SAs with zero jobs still appear in
the result with a count of 0. That matters for downstream
visualisation — empty SAs need to render as “no activity”
rather than dropping out.
The join condition is ST_Contains(sa.boundary, j.location).
PostGIS evaluates this for each candidate (sa, j) pair. The
question is how the database finds candidate pairs — and
that’s where index choice becomes interesting.
The job-side condition lives in the JOIN, not WHERE. Putting
AND j.status IN (...) in the join condition rather than a
WHERE clause matters for LEFT JOIN semantics — it filters
which jobs match an SA, but doesn’t filter SAs themselves out
of the result. With the same predicate in a WHERE clause, SAs
with zero matching jobs would be dropped (because the LEFT JOIN
would produce a NULL row that the WHERE then filters out).
The service object
Build all three variants in a single module. app/services/sa_density.rb:
|
|
Three public methods, one shared private aggregator. Each
method passes a different “additional jobs filter” into the
LEFT JOIN ... ON ... condition.
The plain SQL string interpolation (#{jobs_clause}) is fine
here because the interpolated content is hard-coded inside the
service module — no user input, no injection risk. For
queries that take parameters from outside, you’d parameterise
them properly.
Run all three from console
|
|
Roughly what to expect with the standard chassis indexes in place:
current_state— about 0.4 seconds. Only ~14k jobs match the status filter; the database can scan them all and check containment quickly.last_12_months— about 0.4 seconds. Around 200k jobs match the date filter; the spatial index handles the volume cleanly.all_time— about 0.4 seconds. All 500K jobs, but the GIST index makes the spatial join fast.
All three under half a second. That’s the GIST index doing its work — quietly, invisibly, regardless of how broad the report gets.
Now we’ll see what happens when we take the GIST away.
EXPLAIN ANALYZE — see the plans
Before dropping anything, look at what the database is actually
doing for each query. In bin/rails dbconsole:
|
|
You’ll see something like:
Nested Loop Left Join (...)
Join Filter: st_contains(sa.boundary, j.location)
Rows Removed by Join Filter: 5,021,268
-> Index Scan using service_areas_pkey on service_areas sa (...)
-> Materialize (...)
-> Index Scan using index_jobs_on_status on jobs j (...)
Index Cond: ((status)::text = ANY ('{pending,scheduled,in_progress}'::text[]))
rows=14812
Execution Time: ~400 msNote what the planner chose:
index_jobs_on_status— a B-tree, used to filter jobs to the ~14,812 active ones firstNested LoopwithJoin Filter: st_contains— for each SA, walk all 14,812 candidates, check containment
No spatial index is in this plan. The status B-tree was so selective that the planner determined “filter first, then brute-force containment” was the cheapest path.
Now run the same EXPLAIN against last_12_months:
|
|
The plan changes. The candidate pool is now ~200k jobs instead
of 14k, and the planner may use a different strategy — possibly
a Bitmap Index Scan on index_jobs_on_location to pre-filter
by spatial bucket.
And all_time:
|
|
This one almost certainly uses the GIST. With no row-filtering options, the spatial index is the planner’s best chance to narrow the work.
Drop the GIST index
In bin/rails dbconsole:
|
|
Now run the three benchmarks again:
|
|
You’ll see something like:
current_state— still about 0.4 seconds. No change.last_12_months— around 6 seconds. About 15× slower.all_time— around 11 seconds. About 28× slower.
The pattern: the broader the report, the more the GIST index matters. A query with selective row filters doesn’t miss it. A query without selective filters is crippled without it.
Notice how cleanly the slowdown escalates. Each step in report scope roughly doubles the cost of running without the spatial index. There’s no cliff — just a steady gradient from “the index doesn’t matter” to “the index is essential.” Where on that gradient your application’s queries land determines how much you depend on the GIST.
What the plans look like without the index
EXPLAIN ANALYZE on all_time after the index drop:
|
|
Nested Loop Left Join (...)
Join Filter: st_contains(sa.boundary, j.location)
Rows Removed by Join Filter: 169,500,000
-> Index Scan using service_areas_pkey on service_areas sa (...)
-> Materialize (...)
-> Seq Scan on jobs j (...)
rows=500000
Execution Time: ~11000 msSeq Scan on jobs j is the smoking gun. With no spatial
index, no status filter, no date filter, the planner has no
choice but to read the entire jobs table — for every SA. 500K
× 340 = 170 million ST_Contains evaluations. Each one is
cheap; doing 170 million of them isn’t.
Restore the index
|
|
USING GIST is the magic phrase. Postgres builds the spatial
index — takes a few seconds at 500K rows but is a one-time
cost.
Re-run all three benchmarks:
|
|
All three back to fast. The all-time query, which was 11 seconds without the index, is back to under half a second. Same SQL, same data, same answer.
What this lesson is really teaching
The reflexive lesson would be “always have spatial indexes.” That’s true. But the deeper lesson is what the experiment just demonstrated:
Indexes earn their keep when there’s nothing else to filter on. When the application can pre-filter to a small slice of the data (active status, recent dates, a specific depot), the B-tree index on that filter does most of the work; the GIST index becomes optional. When the application asks a broad question — all-time history, every job, every customer — the GIST is the only filtering option the planner has, and it’s indispensable.
This pattern recurs in every database performance question:
- The most selective predicate available drives plan choice
- An index only matters when the planner can use it and the alternative is significantly worse
- “We have an index” is not the same as “we’re using the index”
In practice for the dispatch deck:
- Current-state reports (the bulk of dashboard content) run fast because the active-jobs slice is small. They’d run fast even without the GIST.
- Trend reports spanning months or years are where the GIST starts to matter. The B-tree on dates helps but not enough for the broadest queries.
- All-time reports absolutely need the GIST. Without it, they’re the kind of query that times out at the load balancer.
EXPLAIN ANALYZE as a habit
A lot of database performance work boils down to: don’t guess,
measure. EXPLAIN ANALYZE is the measurement.
A few patterns worth recognising in the output:
Seq Scanon a large table — usually a red flag. The planner is reading every row. Either there’s no useful index, or the planner thinks scanning is cheaper than the index (which is sometimes correct for small tables; rarely correct for large ones).Index ScanorBitmap Index Scan— the planner is using an index. The line tells you which index.Rows Removed by FilterorRows Removed by Join Filter— how many rows the planner read but discarded. High numbers here suggest the index could be more selective, or that there’s a missing predicate.actual timevs estimatedcost— if the actual time is wildly higher than the cost suggested, the planner’s stats are stale. RunANALYZE jobs(or whichever table) to refresh.
When something feels slow, EXPLAIN ANALYZE is the first step.
You don’t need to read the whole tree fluently to find the
problem — just look for Seq Scan on big tables, or
unexpectedly high actual rows.
Indexes the dispatch deck depends on
A complete list of the spatial-related indexes the chassis ships with:
|
|
Every spatial column gets a GIST. Every foreign key gets a B-tree. Every commonly-filtered column gets a B-tree if it has reasonable cardinality (status: yes, name: usually no). These should be reflexive habit, not deliberation.
Storage cost is real (each index is ~10-30% of the table size typically) but the read performance is overwhelmingly worth it for dispatch-style applications, which are heavy on reads and modest on writes.
Where this leaves us
A real spatial join, three real reports, and a clear sense of when the GIST index matters and when it doesn’t. The query that powers density analysis runs in under a second — fast enough to underpin every reporting visualisation in this module.
The next lesson takes the current_state aggregation and turns it into a choropleth — colouring each SA polygon by its job count, with a continuous-scale legend. Same query result, new presentation layer.