pg_stat_statements: everything it can't

Part one made the core case: pg_stat_statements counts, it doesn't record. It walked through how the queryid jumble fragments one logical query into many rows, how the first-seen text freezes your per-request tags, and how the averages bury the p99 that actually pages you. All of that was about data the extension has and distorts.

This part is about the rest: the entries it silently throws away, the query text that can vanish all at once, the plans and replicas it never records, and the knobs that bite. It ends where part one started, with the question the whole investigation was really about: is this the query store Postgres is missing, or just the floor you'd build one on?

The table fills up and evicts your tail

pg_stat_statements.max defaults to 5000. It's a hard cap on entries, set when the server starts (changing it needs a restart, because the hash table is sized in shared memory up front). When the 5001st distinct shape arrives, Postgres doesn't grow the table. It evicts, throwing out the least-executed entries to make room:

If more distinct statements than that are observed, information about the least-executed statements is discarded.

On a healthy app with a few hundred steady shapes, 5000 is plenty and you never think about it. But remember the row explosion from part one. An ORM that splinters one query into hundreds of shapes, or a pre-18 app building dynamic IN lists, can chew through thousands of entries an hour. Once that starts, the view becomes a sliding window over recent noise. Your steady, important queries get evicted to make room for thousands of one-offs, then rebuilt with fresh counters and a fresh first-seen text when they run again. The stats you were trusting reset themselves, and the view never says a word about it.

One place does say something. The companion view pg_stat_statements_info has exactly two columns, and both matter:

SELECT dealloc, stats_reset FROM pg_stat_statements_info;

 dealloc |          stats_reset
---------+-------------------------------
   18342 | 2026-05-01 03:00:11.482+00

dealloc: Total number of times pg_stat_statements entries about the least-executed statements were deallocated because more distinct statements than pg_stat_statements.max were observed.

When dealloc climbs steadily between resets, your table is too small for how many shapes your workload produces. You're thrashing, and your stats are lossy. The fix is either to raise pg_stat_statements.max and pay the shared memory, or to cut the cardinality at the source by normalizing shapes, moving to PG 18, or switching to array binding. Few people watch dealloc, so a thrashing table quietly drops part of the workload it's meant to report, and nobody notices until the numbers stop adding up.

The second column, stats_reset, is the time of the last global reset of the whole view, and it isn't decoration. It's the validity stamp on every diff you compute. If you snapshot the view hourly to get usable averages, stats_reset tells you whether a given window is even legitimate: if it moved forward between your two snapshots, a reset happened inside the window and every delta across it is meaningless.

Watch dealloc and stats_reset.
A steadily climbing dealloc means the table is too small and your tail is being evicted, so the stats are lossy. A stats_reset that moved between two snapshots means a reset landed inside your window, so any diff across it is meaningless. Almost no monitoring setup checks either.

When the query text vanishes

There's a worse failure mode hiding in the external query-text file. Because representative texts pile up on disk and the file can grow large (long queries, a high max, lots of churn), the extension has a last resort:

pg_stat_statements may choose to discard the query texts, whereupon all existing entries in the pg_stat_statements view will show null query fields, though the statistics associated with each queryid are preserved. If this happens, consider reducing pg_stat_statements.max.

So under pressure it keeps the counters and throws away the text. Your whole view goes query = NULL while the numbers keep marching on, and you're left with a list of anonymous queryids and no idea what they used to be. The docs leave this vague, but the source is specific enough to alert on. Two thresholds are at work.

The first is the garbage-collection trigger. As entries get evicted and rebuilt, their old texts leave dead space in the file. The extension rewrites the file to drop that dead space once it's more than about 50% bloat, specifically when the file on disk grows past mean_query_len × pg_stat_statements.max × 2. With the default max of 5000 and, say, a 1 KB average query, that's a rewrite around the 10 MB mark. That part is normal housekeeping. It's not the failure.

The failure is when that rewrite itself fails, from an I/O error or because it can't allocate room to build the replacement file. At that point the file's contents can't be trusted, so the extension marks every entry's text as invalid (internally, query_len = -1) and the whole query column reads NULL. The counters survive. The texts are gone all at once.

On PostgreSQL 13 and earlier, stats_temp_directory could relocate that file; the GUC was removed in 14.

The practical version: the file lives at $PGDATA/pg_stat_tmp/pgss_query_texts.stat. Alert on its size against mean_query_len × max × 2. If you're routinely near that line you're rewriting the file constantly, and you're one bad write() away from a blank view. Because the trigger scales with your average query length, the cluster most at risk is the one with both lots of entries and very long query texts, which is the giant generated IN-list app all over again. The fix the manual suggests, lowering pg_stat_statements.max, sounds backwards but is right: a smaller table means a smaller file, less bloat to collect, and less to go wrong.

What it never records at all

Everything so far has been about data the extension records but degrades, by normalizing, freezing, averaging, or evicting it. This section is about absence: things pg_stat_statements never captures at all, because they sit outside what it models.

Plans. This is the big one. The extension records that a query ran and how long it took. It doesn't record how it ran. No plan tree, no per-node costs, no join order, no index choice. A query whose plan flipped from an Index Scan to a Seq Scan overnight, the single most common cause of a sudden slowdown, shows up only as a mean_exec_time that went up. The reason is gone. For plans you need EXPLAIN, auto_explain, or a plan-capturing hook. pg_stat_statements sits at the wrong layer to ever see one.

Actual versus estimated rows. With no plan there's no per-node actual_rows to compare against the planner's estimate. The estimation error, which is the leading sign of stale statistics and the thing that predicts a future plan flip, is invisible here. The rows column is total rows returned across all calls, not an estimate measured against reality.

Parameter values. The constants were jumbled out. You can't recover which id was slow, only that the shape WHERE id = $1 was slow on average.

Failed queries. This one surprises people. Stats are updated at the end of a successful phase, and only then. A query that errors out mid-execution isn't counted. The manual lays out the asymmetry exactly:

plans and calls aren't always expected to match because planning and execution statistics are updated at their respective end phase, and only for successful operations. For example, if a statement is successfully planned but fails during the execution phase, only its planning statistics will be updated.

So the query that runs 30 seconds and then dies on a statement timeout, exactly the one you most want to find, contributes nothing to calls or total_exec_time. It planned, so it might bump plans if track_planning is on, and that's the only trace it leaves. Some of your worst queries are invisible precisely because they're your worst queries.

There's no error counter either. A statement that fails leaves no row of its own, so you can't ask pg_stat_statements how often a given shape errors, only how often it succeeded.

Spills, parallel workers, per-node buffers. Sort and hash spills to disk, workers_planned versus workers_launched, the shared/local/temp buffer split per plan node, all of that lives in EXPLAIN (ANALYZE, BUFFERS) output, not in the aggregated shared_blks_* totals. The view gives you a sum of blocks touched. It can't tell you where in the plan they were touched, or whether a sort spilled.

It only sees one node

pg_stat_statements is per-instance. Its hash table lives in this server's shared memory and counts the queries this server ran. It has no idea other nodes exist.

In a single-primary world that was fine. In a modern deployment (a primary with read replicas, an Aurora reader fleet, a Patroni cluster, Citus) it's a serious blind spot. Read traffic is the majority of the work for most apps, and it runs on the replicas. Each replica has its own independent pg_stat_statements, with its own counters, its own eviction, its own first-seen texts. Snapshot only the primary, which is what most monitoring setups do, and you see the write path plus whatever reads happen to land on the primary, while the bulk of the read workload goes unmeasured.

Merging the nodes isn't clean either. You'd want to add up calls for "the same query" across nodes, but the queryid is platform- and version-sensitive, so on a mixed fleet (mid-upgrade, different architectures) the same query can carry different ids on different nodes. Merging a cluster's workload under one canonical fingerprint, with per-node drill-down, is real work that lives above the extension. The extension only ever hands you one node's local tally.

The knobs, and what they cost

A quick tour of the settings, because the defaults hide a few traps.

GUC	Default	What to know
`pg_stat_statements.max`	`5000`	Hard cap on entries; restart to change. Watch `pg_stat_statements_info.dealloc`.
`pg_stat_statements.track`	`top`	`top` tracks client-issued statements only. Queries inside PL/pgSQL functions stay invisible until you set `all`.
`pg_stat_statements.track_utility`	`on`	Counts non-DML (`CREATE`, `VACUUM`, and so on). Can flood the table with one-off DDL on schema-churning systems.
`pg_stat_statements.track_planning`	`off`	Adds planning-time stats, but the manual warns of "a noticeable performance penalty" from contention when many connections run identical shapes and fight over one entry's spinlock. Off for a reason.
`pg_stat_statements.save`	`on`	Persists stats across clean restarts. A crash still loses everything since the last save.

Three that aren't obvious from the table:

track = 'top' hides function internals. If your hot logic lives in PL/pgSQL or stored procedures, the default shows you the SELECT do_the_thing() wrapper and nothing about the queries it runs inside. You have to opt into all and pay for the extra entries.

track_utility counts your transaction verbs. BEGIN, COMMIT, and ROLLBACK are utility statements, so with the default track_utility = on each one gets its own entry. They normalize to a single shape apiece, so it's a handful of rows, not a flood, but an ORM that opens and closes a transaction around every query turns COMMIT into one of your highest-calls rows: a content-free verb sitting at the top of any ranking by call count, ahead of the queries you actually want to see. If your top-by-calls view is mostly transaction control, that's the signal to turn track_utility off.

I/O timings need a second switch. The shared_blk_read_time and shared_blk_write_time columns are only populated when the server-wide track_io_timing is on, and it's off by default. Leave it off and those columns read zero, making every query look like it did no I/O, which is a quietly misleading kind of empty.

Why `track_planning` in particular hurts

The warning on track_planning is worth unpacking, because the obvious objection is the one the manual seems to skip. If a hot query gets executed by thousands of backends billions of times and that's fine, why would planning the same shape be the thing that tips over?

The answer is in how the counters get written. Every entry has its own spinlock (entry->mutex), and updating the entry means grabbing that lock. The update happens once per phase. The extension writes the entry at the end of planning and again at the end of execution, in two separate critical sections. With track_planning off, a query touches its entry's spinlock once per call, at exec-end. With it on, it touches the same lock twice, once at plan-end and once at exec-end. For a shape that lots of connections run at once, that single entry's lock is already the hottest in the system, and doubling the grabs roughly doubles the contention on it.

So it isn't that planning is contended and execution is free. Execution-side contention on that per-entry lock is itself a well-known bottleneck for ultra-hot identical queries. It's that track_planning adds a second grab of the already-hot lock to every call, and piles that extra traffic onto exactly the entries that were most contended to begin with (one shape, run everywhere). That's why a setting that sounds harmless, "also record planning time," ships off by default, and why you should leave it off on any workload dominated by a handful of extremely hot shapes.

Mind the column renames (PG 17)

PostgreSQL 17 renamed the I/O timing columns.
blk_read_time became shared_blk_read_time and blk_write_time became shared_blk_write_time, with new local_blk_* columns added alongside. A query selecting the old names errors after the upgrade; a SELECT * dashboard silently shifts. See the PG 17 release notes.

If you have dashboards or scrapers older than PostgreSQL 17, they may have broken on the upgrade without saying so. The same release also added the stats_since and minmax_stats_since timestamps and a minmax_only argument to pg_stat_statements_reset(). The view's shape isn't a stable contract across majors, so pin your column lists and re-check them on every upgrade.

Who can see what

One security corner to know before you wire this into a shared dashboard. The statistics are visible to everyone, but the SQL text and queryid of other users' queries are not:

For security reasons, only superusers and roles with privileges of the pg_read_all_stats role are allowed to see the SQL text and queryid of queries executed by other users. Other users can see the statistics, however, if the view has been installed in their database.

So a low-privilege monitoring role sees rows full of <insufficient privilege> where the query text should be. The fix is to grant pg_read_all_stats, a built-in role, to the monitoring user. Not superuser. If your "the dashboard shows no query text" mystery has a boring cause, this is usually it.

So what is it good for, and where does it leave you?

None of this is an argument against pg_stat_statements. It's still the first thing to turn on, and for its real job, ranking your query shapes by total time, calls, or I/O on one node since the last reset, nothing else is this cheap or this immediate. Reach for it first, every time. Just be clear about what you're holding:

It counts, it doesn't record. Cumulative totals, no time dimension. Snapshot and diff it yourself if you want history.
It gives you an average, not a spread. Great for "which shape dominates total time," useless for "what's my p99."
It's shape-fragmented. ORMs explode one query into many entries, and pulling them back together is a job for a normalizer on top.
It keeps first-seen text. Static comment tags survive, dynamic per-request tags don't, so it can never be an APM join target.
It's lossy under pressure. A full table evicts your tail. Watch dealloc.
It's plan-blind. It tells you a query is slow, never why. No plan, no estimate error, no spills.
It's single-node. The replica fleet that serves most of your reads is invisible from the primary.

So pg_stat_statements answers one question, what is slow?, and stops. The next questions are outside its model: why it got slow (a plan flip, stats drift), who is calling it, where across the cluster, and what would happen if you changed something. That isn't a missing feature; it's what a pile of aggregated counters fundamentally can't do.

The per-execution detail isn't truly gone, it's in the logs. log_min_duration_statement = 0 or auto_explain records every call with its parameters, tags, and timing, which is where the full distribution and the dynamic tags live. The catch is cost: you can't log every query on a busy system, which is the whole reason the cheap aggregate exists.

That answers the question we opened on. Postgres doesn't ship the query store you came looking for, and pg_stat_statements isn't it. It's the floor you'd build one on, not the thing itself.

So stop reading pg_stat_statements as a picture of your workload. Read it for what it is: a cheap, lossy box of counters that tells you where to point the real instruments next.

The table fills up and evicts your tail

When the query text vanishes

What it never records at all

It only sees one node

The knobs, and what they cost

Why track_planning in particular hurts

Mind the column renames (PG 17)

Who can see what

So what is it good for, and where does it leave you?

Why `track_planning` in particular hurts