Ruby Field Notes

Lesson 3 — Spatial joins and when indexes earn their keep

Every query so far has used relational filtering — WHERE service_area_id = ?, WHERE depot_id = ?. The database joined through B-tree indexes; PostGIS sat there politely without being asked to do anything spatial.

This lesson asks PostGIS its first real spatial question: for each service area, count the jobs inside it. That requires ST_Contains — geometric intersection between each SA’s polygon and each job’s point. The database has to compute geometric containment, not just match foreign keys.

But there’s a subtlety the lesson is going to spend most of its time on. Whether the GIST index helps depends on how broad your report is. A “current state” report, where the active jobs are already a small slice of the data, runs fine without spatial indexing — there’s so little to scan that brute force works. A “total history” report, asking against all 500K jobs with no other filter, needs the GIST index or it grinds.

We’ll write three reports — each progressively broader — and watch the GIST index shift from “doesn’t help” to “essential.” By the end, the reader will understand not just how spatial indexes work but when they matter.

The reports we’ll build

Each is a real dispatch question:

Current state — pending, scheduled, in-progress jobs by SA. What’s happening right now? About 14k matching rows.
This year vs last year — jobs by SA filtered by date range, completed or otherwise. About 200-250k matching rows.
Total history — every job ever, by SA. All 500K rows.

Same SQL shape — LEFT JOIN ... ON ST_Contains(...) aggregated by SA. Just different WHERE clauses, expressing different operational questions.

The shape of the query

Before any code, the SQL all three queries share:

1
2
3
4
5
6


SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
FROM service_areas sa
LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
                  AND <some condition on jobs>
GROUP BY sa.id, sa.code, sa.name
ORDER BY job_count DESC

The variable bit is <some condition on jobs>. That’s where the report’s scope is expressed.

A few things worth knowing about this shape.

LEFT JOIN, not JOIN. SAs with zero jobs still appear in the result with a count of 0. That matters for downstream visualisation — empty SAs need to render as “no activity” rather than dropping out.

The join condition is ST_Contains(sa.boundary, j.location). PostGIS evaluates this for each candidate (sa, j) pair. The question is how the database finds candidate pairs — and that’s where index choice becomes interesting.

The job-side condition lives in the JOIN, not WHERE. Putting AND j.status IN (...) in the join condition rather than a WHERE clause matters for LEFT JOIN semantics — it filters which jobs match an SA, but doesn’t filter SAs themselves out of the result. With the same predicate in a WHERE clause, SAs with zero matching jobs would be dropped (because the LEFT JOIN would produce a NULL row that the WHERE then filters out).

The service object

Build all three variants in a single module. app/services/sa_density.rb:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


module SaDensity
  extend self

  ACTIVE_STATUSES = %w[pending scheduled in_progress].freeze

  # Current state: only active jobs (pending, scheduled, in_progress).
  # The slice the dispatcher cares about right now.
  def current_state
    aggregate("AND j.status IN ('pending','scheduled','in_progress')")
  end

  # Last twelve months. Date-bounded, status-agnostic.
  def last_12_months
    aggregate("AND j.created_at >= NOW() - INTERVAL '12 months'")
  end

  # All time. No filter on jobs at all.
  def all_time
    aggregate("")
  end

  private

  def aggregate(jobs_clause)
    rows = ActiveRecord::Base.connection.select_all(<<~SQL).rows
      SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
      FROM service_areas sa
      LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
                        #{jobs_clause}
      GROUP BY sa.id, sa.code, sa.name
      ORDER BY job_count DESC
    SQL

    rows.map { |id, code, name, count|
      { id: id, code: code, name: name, job_count: count.to_i }
    }
  end
end

Three public methods, one shared private aggregator. Each method passes a different “additional jobs filter” into the LEFT JOIN ... ON ... condition.

The plain SQL string interpolation (#{jobs_clause}) is fine here because the interpolated content is hard-coded inside the service module — no user input, no injection risk. For queries that take parameters from outside, you’d parameterise them properly.

Run all three from console

1
2
3
4
5


require "benchmark"

Benchmark.realtime { SaDensity.current_state }    # ~? seconds
Benchmark.realtime { SaDensity.last_12_months }   # ~? seconds
Benchmark.realtime { SaDensity.all_time }         # ~? seconds

Roughly what to expect with the standard chassis indexes in place:

current_state — about 0.4 seconds. Only ~14k jobs match the status filter; the database can scan them all and check containment quickly.
last_12_months — about 0.4 seconds. Around 200k jobs match the date filter; the spatial index handles the volume cleanly.
all_time — about 0.4 seconds. All 500K jobs, but the GIST index makes the spatial join fast.

All three under half a second. That’s the GIST index doing its work — quietly, invisibly, regardless of how broad the report gets.

Now we’ll see what happens when we take the GIST away.

EXPLAIN ANALYZE — see the plans

Before dropping anything, look at what the database is actually doing for each query. In bin/rails dbconsole:

1
2
3
4
5
6


EXPLAIN ANALYZE
SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
FROM service_areas sa
LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
                  AND j.status IN ('pending','scheduled','in_progress')
GROUP BY sa.id, sa.code, sa.name;

You’ll see something like:

Nested Loop Left Join  (...)
  Join Filter: st_contains(sa.boundary, j.location)
  Rows Removed by Join Filter: 5,021,268
  ->  Index Scan using service_areas_pkey on service_areas sa  (...)
  ->  Materialize  (...)
        ->  Index Scan using index_jobs_on_status on jobs j  (...)
              Index Cond: ((status)::text = ANY ('{pending,scheduled,in_progress}'::text[]))
              rows=14812
Execution Time: ~400 ms

Note what the planner chose:

index_jobs_on_status — a B-tree, used to filter jobs to the ~14,812 active ones first
Nested Loop with Join Filter: st_contains — for each SA, walk all 14,812 candidates, check containment

No spatial index is in this plan. The status B-tree was so selective that the planner determined “filter first, then brute-force containment” was the cheapest path.

Now run the same EXPLAIN against last_12_months:

1
2
3
4
5
6


EXPLAIN ANALYZE
SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
FROM service_areas sa
LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
                  AND j.created_at >= NOW() - INTERVAL '12 months'
GROUP BY sa.id, sa.code, sa.name;

The plan changes. The candidate pool is now ~200k jobs instead of 14k, and the planner may use a different strategy — possibly a Bitmap Index Scan on index_jobs_on_location to pre-filter by spatial bucket.

And all_time:

1
2
3
4
5


EXPLAIN ANALYZE
SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
FROM service_areas sa
LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
GROUP BY sa.id, sa.code, sa.name;

This one almost certainly uses the GIST. With no row-filtering options, the spatial index is the planner’s best chance to narrow the work.

Drop the GIST index

In bin/rails dbconsole:

1

DROP INDEX index_jobs_on_location;

Now run the three benchmarks again:

1
2
3


Benchmark.realtime { SaDensity.current_state }    # ?
Benchmark.realtime { SaDensity.last_12_months }   # ?
Benchmark.realtime { SaDensity.all_time }         # ?

You’ll see something like:

current_state — still about 0.4 seconds. No change.
last_12_months — around 6 seconds. About 15× slower.
all_time — around 11 seconds. About 28× slower.

The pattern: the broader the report, the more the GIST index matters. A query with selective row filters doesn’t miss it. A query without selective filters is crippled without it.

Notice how cleanly the slowdown escalates. Each step in report scope roughly doubles the cost of running without the spatial index. There’s no cliff — just a steady gradient from “the index doesn’t matter” to “the index is essential.” Where on that gradient your application’s queries land determines how much you depend on the GIST.

What the plans look like without the index

EXPLAIN ANALYZE on all_time after the index drop:

1
2
3
4
5


EXPLAIN ANALYZE
SELECT sa.id, sa.code, sa.name, COUNT(j.id) AS job_count
FROM service_areas sa
LEFT JOIN jobs j ON ST_Contains(sa.boundary, j.location)
GROUP BY sa.id, sa.code, sa.name;

Nested Loop Left Join  (...)
  Join Filter: st_contains(sa.boundary, j.location)
  Rows Removed by Join Filter: 169,500,000
  ->  Index Scan using service_areas_pkey on service_areas sa  (...)
  ->  Materialize  (...)
        ->  Seq Scan on jobs j  (...)
              rows=500000
Execution Time: ~11000 ms

Seq Scan on jobs j is the smoking gun. With no spatial index, no status filter, no date filter, the planner has no choice but to read the entire jobs table — for every SA. 500K × 340 = 170 million ST_Contains evaluations. Each one is cheap; doing 170 million of them isn’t.

Restore the index

1

CREATE INDEX index_jobs_on_location ON jobs USING GIST (location);

USING GIST is the magic phrase. Postgres builds the spatial index — takes a few seconds at 500K rows but is a one-time cost.

Re-run all three benchmarks:

1
2
3


Benchmark.realtime { SaDensity.current_state }    # ~0.4s
Benchmark.realtime { SaDensity.last_12_months }   # ~0.4s
Benchmark.realtime { SaDensity.all_time }         # ~0.4s

All three back to fast. The all-time query, which was 11 seconds without the index, is back to under half a second. Same SQL, same data, same answer.

What this lesson is really teaching

The reflexive lesson would be “always have spatial indexes.” That’s true. But the deeper lesson is what the experiment just demonstrated:

Indexes earn their keep when there’s nothing else to filter on. When the application can pre-filter to a small slice of the data (active status, recent dates, a specific depot), the B-tree index on that filter does most of the work; the GIST index becomes optional. When the application asks a broad question — all-time history, every job, every customer — the GIST is the only filtering option the planner has, and it’s indispensable.

This pattern recurs in every database performance question:

The most selective predicate available drives plan choice
An index only matters when the planner can use it and the alternative is significantly worse
“We have an index” is not the same as “we’re using the index”

In practice for the dispatch deck:

Current-state reports (the bulk of dashboard content) run fast because the active-jobs slice is small. They’d run fast even without the GIST.
Trend reports spanning months or years are where the GIST starts to matter. The B-tree on dates helps but not enough for the broadest queries.
All-time reports absolutely need the GIST. Without it, they’re the kind of query that times out at the load balancer.

EXPLAIN ANALYZE as a habit

A lot of database performance work boils down to: don’t guess, measure. EXPLAIN ANALYZE is the measurement.

A few patterns worth recognising in the output:

Seq Scan on a large table — usually a red flag. The planner is reading every row. Either there’s no useful index, or the planner thinks scanning is cheaper than the index (which is sometimes correct for small tables; rarely correct for large ones).
Index Scan or Bitmap Index Scan — the planner is using an index. The line tells you which index.
Rows Removed by Filter or Rows Removed by Join Filter — how many rows the planner read but discarded. High numbers here suggest the index could be more selective, or that there’s a missing predicate.
actual time vs estimated cost — if the actual time is wildly higher than the cost suggested, the planner’s stats are stale. Run ANALYZE jobs (or whichever table) to refresh.

When something feels slow, EXPLAIN ANALYZE is the first step. You don’t need to read the whole tree fluently to find the problem — just look for Seq Scan on big tables, or unexpectedly high actual rows.

Indexes the dispatch deck depends on

A complete list of the spatial-related indexes the chassis ships with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


-- B-tree indexes for relational lookups
CREATE INDEX ON service_areas (depot_id);
CREATE INDEX ON jobs (service_area_id);
CREATE INDEX ON jobs (status);
CREATE INDEX ON jobs (created_at);
CREATE INDEX ON jobs (assigned_to_id);
CREATE INDEX ON customers (suburb);

-- GIST indexes for spatial queries
CREATE INDEX ON service_areas USING GIST (boundary);
CREATE INDEX ON customers     USING GIST (location);
CREATE INDEX ON jobs          USING GIST (location);
CREATE INDEX ON depots        USING GIST (location);

Every spatial column gets a GIST. Every foreign key gets a B-tree. Every commonly-filtered column gets a B-tree if it has reasonable cardinality (status: yes, name: usually no). These should be reflexive habit, not deliberation.

Storage cost is real (each index is ~10-30% of the table size typically) but the read performance is overwhelmingly worth it for dispatch-style applications, which are heavy on reads and modest on writes.

Where this leaves us

A real spatial join, three real reports, and a clear sense of when the GIST index matters and when it doesn’t. The query that powers density analysis runs in under a second — fast enough to underpin every reporting visualisation in this module.

The next lesson takes the current_state aggregation and turns it into a choropleth — colouring each SA polygon by its job count, with a continuous-scale legend. Same query result, new presentation layer.

Clustering and spiderfy The choropleth report