Appendix A — SQL Reference
This appendix is reference material for SQL constructs that appear repeatedly in the tutorial’s analytical queries. The lessons use these constructs as needed and link here for the full treatment. You can read this appendix front to back, dip in for a refresher, or treat it as a glossary.
Common Table Expressions
A Common Table Expression — CTE — is a named query result
defined inline using the WITH keyword. The CTE only exists for
the duration of the statement; nothing persists.
The basic shape:
|
|
The WITH clause defines a named query. The name then acts like a
table in the main query that follows. You can read it as: “first,
compute this; then, do this with it.”
Why CTEs matter
The most common use is breaking a complex query into steps that read top-down. Without CTEs, multi-step queries get expressed as nested subqueries — readable up to a point, awkward beyond it.
Compare:
|
|
|
|
Same query, same plan, same performance. The CTE version reads as a narrative; the subquery version reads inside-out. For two-line queries the difference is cosmetic. For non-trivial analytical queries, CTEs are dramatically more readable.
When CTEs are necessary, not just nice
Some operations don’t compose into a single SELECT. The classic case is mixing window functions with aggregates:
- Window functions (
LAG,RANK,ROW_NUMBER) operate on un-aggregated rows - Aggregates (
SUM,COUNT,AVG) collapse rows into groups
You can’t write SUM(LAG(value) OVER (...)) directly in a
meaningful way. A CTE separates the steps:
|
|
The CTE produces one row per source row with the windowed value attached; the outer query aggregates. Without the CTE you’d need a subquery in the same shape, just less readable.
This pattern shows up repeatedly in analytical work: stage the per-row computation in a CTE, aggregate or summarise in the outer query.
Multiple CTEs
A statement can chain multiple CTEs, separated by commas. Later CTEs can reference earlier ones:
|
|
Each CTE is a logical step. The query reads top-down through the pipeline.
Recursive CTEs
A more advanced form, marked WITH RECURSIVE, computes results
that reference themselves. The classic use cases are:
- Tree traversal: walking a parent-child hierarchy (organisation charts, threaded comments, file system trees)
- Graph traversal: finding paths in network data
- Generated sequences: producing a series of dates, numbers, or rows on the fly
A recursive CTE has two parts joined by UNION ALL:
|
|
The anchor produces the initial row(s); the recursive part builds on them; PostgreSQL keeps applying the recursive part until no new rows are produced.
The tutorial doesn’t use recursive CTEs — the data model doesn’t have hierarchies that need them. But if you encounter parent-child structures (categories nested inside categories, employees reporting to managers, comments replying to comments), recursive CTEs are how SQL traverses them.
Performance notes
In PostgreSQL 12 and later, CTEs are usually inlined into the main query plan automatically — the planner treats them like subqueries, free to optimise across the boundary. Performance is generally identical to the subquery equivalent.
Earlier PostgreSQL versions (and some other databases) treat CTEs as optimisation fences: each CTE is materialised separately, then the main query runs against the materialised result. This was sometimes slower than an inlined subquery would have been.
If you ever need to explicitly control inlining behaviour in
modern PostgreSQL, the syntax is WITH cte_name AS [NOT] MATERIALIZED (...). NOT MATERIALIZED forces inlining;
MATERIALIZED forces a separate materialisation step. The default
(no keyword) lets the planner choose, which is usually the right
answer.
For analytical work, write CTEs naturally and trust the planner.
Reach for the MATERIALIZED hint only if EXPLAIN ANALYZE shows
the planner making a measurably wrong choice.
Window functions
Window functions operate on a set of rows related to the current
row, producing a value per row without collapsing the result set
(unlike GROUP BY, which produces one row per group).
The general syntax:
|
|
PARTITION BYdivides rows into groups (like GROUP BY but without aggregating)ORDER BYorders rows within each partitionfunction_name()operates over the ordered partition
Both PARTITION BY and ORDER BY are optional. With neither, the
window covers the entire result set. With just ORDER BY, the
window covers the entire result set in the given order. With both,
the window covers each partition independently in the given order.
Common window functions
LAG(column [, offset [, default]]) — returns a column’s value
from a previous row in the ordered partition.
|
|
The first argument is the column. The second (optional, defaults to
- is how many rows back. The third (optional) is the value to return when there is no previous row (instead of NULL).
LEAD(column [, offset [, default]]) — like LAG but looks
forward instead of backward. Symmetric: LEAD(col, 1) from row N
returns the same value as LAG(col, 1) from row N+1.
RANK() — assigns a rank to each row within its partition,
based on the ORDER BY. Tied rows get the same rank; the next rank
skips the tied count.
|
|
If two contestants tie for first, both get rank 1, and the next contestant gets rank 3.
DENSE_RANK() — like RANK, but doesn’t skip ranks after ties.
Two contestants tie for first, both get rank 1, next gets rank 2.
ROW_NUMBER() — assigns a unique sequential number to each row
within its partition. Tied rows get different row numbers based on
arbitrary internal order. Useful for “give me the first row per
group” queries.
|
|
FIRST_VALUE(column) and LAST_VALUE(column) — return the
first or last value within the window. Note that LAST_VALUE
requires careful frame specification to work intuitively (by
default, the window frame ends at the current row, not the partition
end).
Aggregate functions as window functions — SUM, AVG, COUNT,
MAX, MIN, etc. can all be used with OVER (...). They produce
a running aggregate value for each row, rather than collapsing rows
into a single result.
|
|
Each row gets the running sum of all transactions for that user up to and including that date.
Window functions vs aggregates
A common confusion. The same function can act as a window function
or an aggregate, depending on whether it’s followed by OVER (...).
|
|
The first returns one row per user with their total. The second returns every transaction with the user’s total alongside it.
Ordering and frames
The ORDER BY inside an OVER clause matters for two reasons:
-
It defines the order in which
LAG,LEAD,RANK, etc. compute their results. Different ORDER BY produces different “previous row,” “next row,” “rank.” -
It affects the implicit window frame for aggregate window functions. With ORDER BY, the default frame is “from start of partition to current row” — producing the running aggregate pattern. Without ORDER BY, the frame is the entire partition.
Custom frames are defined with ROWS or RANGE clauses
(ROWS BETWEEN 2 PRECEDING AND CURRENT ROW, etc.). Useful for
moving averages and similar patterns. Outside the scope of most
analytical work; reach for them when needed.
Aggregate FILTER clause
The FILTER clause lets you apply a condition to an aggregate
function, counting or summing only the rows that match:
|
|
Each aggregate gets its own filter. Without FILTER, you’d need
either multiple subqueries or CASE expressions inside the
aggregates:
|
|
The FILTER form is cleaner and more explicit. It’s a SQL standard
feature supported in PostgreSQL since 9.4.
FILTER works with any aggregate — COUNT, SUM, AVG, MIN,
MAX, even custom aggregates. You can also combine it with window
functions:
|
|
A useful pattern for calculating multiple metrics from the same underlying rows in a single pass.
CASE expressions
CASE is SQL’s conditional expression — the equivalent of if/else
in procedural languages. Two forms.
Simple CASE
Compares one expression against multiple values:
|
|
Returns the first matching THEN value. ELSE is optional; without
it, non-matching rows return NULL.
Useful for translating one set of values into another — status codes to labels, priority strings to numeric weights, etc.
Searched CASE
Each WHEN clause is its own boolean expression:
|
|
More flexible than the simple form. Each condition can involve
different columns and operators. Returns the first matching
THEN value.
Common uses
Conditional aggregation (largely superseded by FILTER but
still common):
|
|
Computed columns:
|
|
Mapping enum-like values in interval contexts (priorities to SLA targets, for example) — the SLA performance query in Module 6 uses this pattern.
COALESCE and NULLIF
Two small but recurring helpers worth knowing.
COALESCE
COALESCE(a, b, c, ...) returns the first non-null argument. If
all arguments are null, it returns null.
|
|
For each row, returns nickname if set; otherwise first_name if set;
otherwise the literal 'Unknown'.
Common uses:
- Default values for nullable columns:
COALESCE(notes, '')so string concatenation doesn’t blow up - Fallback chains: try the most specific value first, fall back to less specific
- Aggregate defaults:
COALESCE(SUM(amount), 0)so empty result sets return 0 instead of NULL
NULLIF
NULLIF(a, b) returns NULL if a = b, otherwise returns a. The
inverse-ish of COALESCE.
|
|
Returns description unless it’s an empty string, in which case
returns NULL.
Most useful for:
- Treating empty strings as NULL for cleaner downstream logic
- Avoiding division by zero:
amount / NULLIF(divisor, 0)returns NULL when divisor is 0, instead of erroring - Combined with COALESCE for “if blank, use default”:
COALESCE(NULLIF(description, ''), 'Service call')
The combination is common enough that you’ll see it often in production code:
|
|
“If description is null or empty, use ‘Service call’.”
When to use what
A small mental cheat sheet:
- Need to stage a query in steps for readability? → CTE
- Need to compute per-row values that look at other rows? → Window function
- Need to aggregate only some rows? →
FILTERclause - Need to map values to other values? →
CASE - Need to handle NULLs gracefully? →
COALESCE/NULLIF
These constructs compose. A real analytical query might use a CTE to stage a window function calculation, an aggregate with a FILTER to summarise it, and a CASE expression to bucket the results into categories. The combination is what makes SQL powerful for analysis.