Skip to content

Appendix A — SQL Reference

This appendix is reference material for SQL constructs that appear repeatedly in the tutorial’s analytical queries. The lessons use these constructs as needed and link here for the full treatment. You can read this appendix front to back, dip in for a refresher, or treat it as a glossary.

Common Table Expressions

A Common Table Expression — CTE — is a named query result defined inline using the WITH keyword. The CTE only exists for the duration of the statement; nothing persists.

The basic shape:

1
2
3
4
5
WITH cte_name AS (
  SELECT ...
)
SELECT ...
FROM cte_name;

The WITH clause defines a named query. The name then acts like a table in the main query that follows. You can read it as: “first, compute this; then, do this with it.”

Why CTEs matter

The most common use is breaking a complex query into steps that read top-down. Without CTEs, multi-step queries get expressed as nested subqueries — readable up to a point, awkward beyond it.

Compare:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Without CTE: nested subquery in FROM
SELECT u.name, AVG(travel.distance_m) AS avg_distance
FROM (
  SELECT assigned_to_id,
         ST_Distance(location, LAG(location) OVER (...)) AS distance_m
  FROM jobs
  WHERE ...
) AS travel
JOIN users u ON u.id = travel.assigned_to_id
GROUP BY u.id, u.name;
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-- With CTE: linear top-down structure
WITH travel AS (
  SELECT assigned_to_id,
         ST_Distance(location, LAG(location) OVER (...)) AS distance_m
  FROM jobs
  WHERE ...
)
SELECT u.name, AVG(t.distance_m) AS avg_distance
FROM travel t
JOIN users u ON u.id = t.assigned_to_id
GROUP BY u.id, u.name;

Same query, same plan, same performance. The CTE version reads as a narrative; the subquery version reads inside-out. For two-line queries the difference is cosmetic. For non-trivial analytical queries, CTEs are dramatically more readable.

When CTEs are necessary, not just nice

Some operations don’t compose into a single SELECT. The classic case is mixing window functions with aggregates:

  • Window functions (LAG, RANK, ROW_NUMBER) operate on un-aggregated rows
  • Aggregates (SUM, COUNT, AVG) collapse rows into groups

You can’t write SUM(LAG(value) OVER (...)) directly in a meaningful way. A CTE separates the steps:

1
2
3
4
5
WITH stage_one AS (
  SELECT col, LAG(col) OVER (...) AS previous
  FROM ...
)
SELECT SUM(previous) FROM stage_one;

The CTE produces one row per source row with the windowed value attached; the outer query aggregates. Without the CTE you’d need a subquery in the same shape, just less readable.

This pattern shows up repeatedly in analytical work: stage the per-row computation in a CTE, aggregate or summarise in the outer query.

Multiple CTEs

A statement can chain multiple CTEs, separated by commas. Later CTEs can reference earlier ones:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
WITH 
  active_jobs AS (
    SELECT * FROM jobs 
    WHERE status IN ('pending', 'scheduled', 'in_progress')
  ),
  per_sa AS (
    SELECT service_area_id, COUNT(*) AS job_count
    FROM active_jobs
    GROUP BY service_area_id
  )
SELECT sa.name, p.job_count
FROM per_sa p
JOIN service_areas sa ON sa.id = p.service_area_id
ORDER BY p.job_count DESC;

Each CTE is a logical step. The query reads top-down through the pipeline.

Recursive CTEs

A more advanced form, marked WITH RECURSIVE, computes results that reference themselves. The classic use cases are:

  • Tree traversal: walking a parent-child hierarchy (organisation charts, threaded comments, file system trees)
  • Graph traversal: finding paths in network data
  • Generated sequences: producing a series of dates, numbers, or rows on the fly

A recursive CTE has two parts joined by UNION ALL:

1
2
3
4
5
6
7
8
WITH RECURSIVE counter AS (
  SELECT 1 AS n                -- anchor: the starting case
  UNION ALL
  SELECT n + 1 FROM counter    -- recursive: builds on previous
  WHERE n < 10
)
SELECT n FROM counter;
-- Returns: 1, 2, 3, ..., 10

The anchor produces the initial row(s); the recursive part builds on them; PostgreSQL keeps applying the recursive part until no new rows are produced.

The tutorial doesn’t use recursive CTEs — the data model doesn’t have hierarchies that need them. But if you encounter parent-child structures (categories nested inside categories, employees reporting to managers, comments replying to comments), recursive CTEs are how SQL traverses them.

Performance notes

In PostgreSQL 12 and later, CTEs are usually inlined into the main query plan automatically — the planner treats them like subqueries, free to optimise across the boundary. Performance is generally identical to the subquery equivalent.

Earlier PostgreSQL versions (and some other databases) treat CTEs as optimisation fences: each CTE is materialised separately, then the main query runs against the materialised result. This was sometimes slower than an inlined subquery would have been.

If you ever need to explicitly control inlining behaviour in modern PostgreSQL, the syntax is WITH cte_name AS [NOT] MATERIALIZED (...). NOT MATERIALIZED forces inlining; MATERIALIZED forces a separate materialisation step. The default (no keyword) lets the planner choose, which is usually the right answer.

For analytical work, write CTEs naturally and trust the planner. Reach for the MATERIALIZED hint only if EXPLAIN ANALYZE shows the planner making a measurably wrong choice.

Window functions

Window functions operate on a set of rows related to the current row, producing a value per row without collapsing the result set (unlike GROUP BY, which produces one row per group).

The general syntax:

1
2
3
4
function_name() OVER (
  PARTITION BY column_list
  ORDER BY column_list
)
  • PARTITION BY divides rows into groups (like GROUP BY but without aggregating)
  • ORDER BY orders rows within each partition
  • function_name() operates over the ordered partition

Both PARTITION BY and ORDER BY are optional. With neither, the window covers the entire result set. With just ORDER BY, the window covers the entire result set in the given order. With both, the window covers each partition independently in the given order.

Common window functions

LAG(column [, offset [, default]]) — returns a column’s value from a previous row in the ordered partition.

1
2
3
SELECT id, started_at,
       LAG(started_at) OVER (PARTITION BY user_id ORDER BY started_at) AS previous_start
FROM jobs;

The first argument is the column. The second (optional, defaults to

  1. is how many rows back. The third (optional) is the value to return when there is no previous row (instead of NULL).

LEAD(column [, offset [, default]]) — like LAG but looks forward instead of backward. Symmetric: LEAD(col, 1) from row N returns the same value as LAG(col, 1) from row N+1.

RANK() — assigns a rank to each row within its partition, based on the ORDER BY. Tied rows get the same rank; the next rank skips the tied count.

1
2
3
SELECT name, score,
       RANK() OVER (PARTITION BY category ORDER BY score DESC) AS rank
FROM contestants;

If two contestants tie for first, both get rank 1, and the next contestant gets rank 3.

DENSE_RANK() — like RANK, but doesn’t skip ranks after ties. Two contestants tie for first, both get rank 1, next gets rank 2.

ROW_NUMBER() — assigns a unique sequential number to each row within its partition. Tied rows get different row numbers based on arbitrary internal order. Useful for “give me the first row per group” queries.

1
2
3
4
5
6
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
  FROM activity_log
)
SELECT * FROM ranked WHERE rn = 1;  -- most recent activity per user

FIRST_VALUE(column) and LAST_VALUE(column) — return the first or last value within the window. Note that LAST_VALUE requires careful frame specification to work intuitively (by default, the window frame ends at the current row, not the partition end).

Aggregate functions as window functionsSUM, AVG, COUNT, MAX, MIN, etc. can all be used with OVER (...). They produce a running aggregate value for each row, rather than collapsing rows into a single result.

1
2
3
SELECT date, amount,
       SUM(amount) OVER (PARTITION BY user_id ORDER BY date) AS running_total
FROM transactions;

Each row gets the running sum of all transactions for that user up to and including that date.

Window functions vs aggregates

A common confusion. The same function can act as a window function or an aggregate, depending on whether it’s followed by OVER (...).

1
2
3
4
5
-- Aggregate: collapses rows, one result per group
SELECT user_id, SUM(amount) FROM transactions GROUP BY user_id;

-- Window function: preserves rows, adds a per-row sum
SELECT user_id, amount, SUM(amount) OVER (PARTITION BY user_id) FROM transactions;

The first returns one row per user with their total. The second returns every transaction with the user’s total alongside it.

Ordering and frames

The ORDER BY inside an OVER clause matters for two reasons:

  1. It defines the order in which LAG, LEAD, RANK, etc. compute their results. Different ORDER BY produces different “previous row,” “next row,” “rank.”

  2. It affects the implicit window frame for aggregate window functions. With ORDER BY, the default frame is “from start of partition to current row” — producing the running aggregate pattern. Without ORDER BY, the frame is the entire partition.

Custom frames are defined with ROWS or RANGE clauses (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW, etc.). Useful for moving averages and similar patterns. Outside the scope of most analytical work; reach for them when needed.

Aggregate FILTER clause

The FILTER clause lets you apply a condition to an aggregate function, counting or summing only the rows that match:

1
2
3
4
5
6
SELECT 
  COUNT(*) AS total_jobs,
  COUNT(*) FILTER (WHERE status = 'complete') AS completed_jobs,
  COUNT(*) FILTER (WHERE priority = 'urgent') AS urgent_jobs,
  AVG(price) FILTER (WHERE status = 'complete') AS avg_completed_price
FROM jobs;

Each aggregate gets its own filter. Without FILTER, you’d need either multiple subqueries or CASE expressions inside the aggregates:

1
2
3
4
5
-- Old-style equivalent without FILTER
SELECT 
  COUNT(*) AS total_jobs,
  COUNT(CASE WHEN status = 'complete' THEN 1 END) AS completed_jobs,
  ...

The FILTER form is cleaner and more explicit. It’s a SQL standard feature supported in PostgreSQL since 9.4.

FILTER works with any aggregate — COUNT, SUM, AVG, MIN, MAX, even custom aggregates. You can also combine it with window functions:

1
COUNT(*) FILTER (WHERE status = 'complete') OVER (PARTITION BY service_area_id)

A useful pattern for calculating multiple metrics from the same underlying rows in a single pass.

CASE expressions

CASE is SQL’s conditional expression — the equivalent of if/else in procedural languages. Two forms.

Simple CASE

Compares one expression against multiple values:

1
2
3
4
5
6
CASE status
  WHEN 'pending'     THEN 'Awaiting'
  WHEN 'in_progress' THEN 'Active'
  WHEN 'complete'    THEN 'Done'
  ELSE 'Other'
END

Returns the first matching THEN value. ELSE is optional; without it, non-matching rows return NULL.

Useful for translating one set of values into another — status codes to labels, priority strings to numeric weights, etc.

Searched CASE

Each WHEN clause is its own boolean expression:

1
2
3
4
5
6
CASE 
  WHEN priority = 'urgent' AND created_at < NOW() - INTERVAL '1 hour' THEN 'Overdue'
  WHEN priority = 'urgent' THEN 'Urgent'
  WHEN created_at < NOW() - INTERVAL '24 hours' THEN 'Stale'
  ELSE 'Normal'
END

More flexible than the simple form. Each condition can involve different columns and operators. Returns the first matching THEN value.

Common uses

Conditional aggregation (largely superseded by FILTER but still common):

1
SUM(CASE WHEN status = 'complete' THEN amount ELSE 0 END) AS completed_amount

Computed columns:

1
2
3
4
5
6
7
SELECT name,
       CASE 
         WHEN salary > 100000 THEN 'high'
         WHEN salary > 50000  THEN 'mid'
         ELSE 'entry'
       END AS salary_band
FROM employees;

Mapping enum-like values in interval contexts (priorities to SLA targets, for example) — the SLA performance query in Module 6 uses this pattern.

COALESCE and NULLIF

Two small but recurring helpers worth knowing.

COALESCE

COALESCE(a, b, c, ...) returns the first non-null argument. If all arguments are null, it returns null.

1
2
SELECT COALESCE(nickname, first_name, 'Unknown') AS display_name
FROM users;

For each row, returns nickname if set; otherwise first_name if set; otherwise the literal 'Unknown'.

Common uses:

  • Default values for nullable columns: COALESCE(notes, '') so string concatenation doesn’t blow up
  • Fallback chains: try the most specific value first, fall back to less specific
  • Aggregate defaults: COALESCE(SUM(amount), 0) so empty result sets return 0 instead of NULL

NULLIF

NULLIF(a, b) returns NULL if a = b, otherwise returns a. The inverse-ish of COALESCE.

1
SELECT NULLIF(description, '') AS description FROM jobs;

Returns description unless it’s an empty string, in which case returns NULL.

Most useful for:

  • Treating empty strings as NULL for cleaner downstream logic
  • Avoiding division by zero: amount / NULLIF(divisor, 0) returns NULL when divisor is 0, instead of erroring
  • Combined with COALESCE for “if blank, use default”: COALESCE(NULLIF(description, ''), 'Service call')

The combination is common enough that you’ll see it often in production code:

1
COALESCE(NULLIF(j.description, ''), 'Service call')

“If description is null or empty, use ‘Service call’.”

When to use what

A small mental cheat sheet:

  • Need to stage a query in steps for readability? → CTE
  • Need to compute per-row values that look at other rows? → Window function
  • Need to aggregate only some rows?FILTER clause
  • Need to map values to other values?CASE
  • Need to handle NULLs gracefully?COALESCE / NULLIF

These constructs compose. A real analytical query might use a CTE to stage a window function calculation, an aggregate with a FILTER to summarise it, and a CASE expression to bucket the results into categories. The combination is what makes SQL powerful for analysis.