pg_regresql: truly portable PostgreSQL statistics

The previous article showed that PostgreSQL 18 makes optimizer statistics portable, but left one gap open:

It's not worth trying to inject relpages as the planner checks the actual file size and scales it proportionally.

The planner doesn't trust pg_class.relpages. It calls smgrnblocks() to read the actual number of 8KB pages from disk. Your table is 74 pages on disk but pg_class.relpages says 123,513? The planner uses the ratio to scale reltuples down to match the actual file size. The selectivity ratios stay correct, plan shapes mostly survive, but the absolute cost estimates are off.

For debugging a single query, that's usually fine. For automated regression testing where you compare EXPLAIN costs across runs, it breaks things. A cost threshold of 2× means something different when the baseline was computed from fake-scaled numbers.

As part of my work on RegreSQL I'm happy to announce pg_regresql extension which fixes this by hooking directly into the planner.

Why the planner ignores relpages

When PostgreSQL's planner calls get_relation_info() in plancat.c, it delegates to estimate_rel_size() which ends up in table_block_relation_estimate_size() in tableam.c. There, the actual page count comes from the storage manager:

curpages = RelationGetNumberOfBlocks(rel);

The function then computes a tuple density from pg_class (reltuples / relpages) and multiplies it by curpages to estimate tuples. So pg_class.reltuples isn't ignored, it's scaled to match the real file size. The reasoning is sound for normal operation: the catalog might be stale, but the file system is always current.

The same applies to indexes. The planner reads their actual sizes from disk too.

What pg_regresql does

The extension hooks into get_relation_info_hook, a planner callback that runs after PostgreSQL reads the physical file stats. The hook replaces the file-based numbers with the values stored in pg_class:

rel->pages ← pg_class.relpages
rel->tuples ← pg_class.reltuples
rel->allvisfrac ← pg_class.relallvisible / pg_class.relpages

It does the same for every index in rel->indexlist. Pages and tuples for each index are overridden from the index's own pg_class entry.

The guard conditions are simple: skip the override if relpages == 0 (empty or never analyzed) or reltuples == -1 (never analyzed). The hook only activates for tables that have been ANALYZEd or had statistics injected.

Installation

Build from source using PGXS:

cd pg_ext
make
make install

It requires no GUCs, background workers, or shared memory.

Usage

Load the extension in your session:

LOAD 'pg_regresql';

That's it. Every EXPLAIN in this session will now use catalog statistics instead of file sizes. There are no functions to call, no tables to configure.

You can also load it per-database by adding it to session_preload_libraries in postgresql.conf or via ALTER DATABASE:

ALTER DATABASE test_db SET session_preload_libraries = 'pg_regresql';

The difference it makes

Using the same test_orders example from the previous article: 10,000 actual rows, injected with production statistics claiming 50 million rows across 123,513 pages.

Without pg_regresql:

EXPLAIN SELECT * FROM test_orders WHERE created_at > '2024-06-01';

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Index Scan using test_orders_created_at_idx on test_orders  (cost=0.29..153.21 rows=6340 width=26)
   Index Cond: (created_at > '2024-06-01'::date)

The plan shape is correct (index scan thanks to the histogram), but the row estimate is 6,340. For a 50-million-row table where the filter covers roughly 10% of the histogram range, the expected estimate should be in the millions. The planner saw 74 real pages on disk, scaled reltuples down to ~30,000, then applied selectivity. The ratio is preserved but the absolute number is wrong.

With pg_regresql:

LOAD 'pg_regresql';
EXPLAIN SELECT * FROM test_orders WHERE created_at > '2024-06-01';

                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Index Scan using test_orders_created_at_idx on test_orders  (cost=0.29..153212.27 rows=10791836 width=27)
   Index Cond: (created_at > '2024-06-01'::date)
(2 rows)

Cost numbers now reflect the full 50 million rows.

Where this matters

Cost-based regression testing. If you're comparing EXPLAIN costs between schema versions (which is what RegreSQL does), you need the absolute numbers to be stable and realistic. With the scaling behavior, your baseline costs are proportional to your test database size, not production. A migration that doubles a cost in production might show a 1.3× increase in CI because the scaled-down numbers compress the range.

Reproducing production plans on a laptop. Sometimes the plan shape itself changes depending on the absolute numbers. A query with multiple joins might get a different join order when the planner sees 50 million rows vs. 30,000 rows, because the cost crossover between hash join and nested loop depends on the absolute row count, not just the ratio.

Index-only scans. The allvisfrac (fraction of all-visible pages) matters for index-only scan costing. Without the hook, allvisfrac is computed from the real relallvisible catalog value divided by the real page count. With injected stats, relallvisible might be 120,000 but the real page count is 74, so the fraction clamps to 1.0 and the planner overestimates how cheap index-only scans are. The hook fixes this by using the injected relpages as the denominator.

What it doesn't do

Column-level statistics and ANALYZE behavior are unchanged. The extension only affects how the planner reads table and index sizes. One thing worth noting: EXPLAIN ANALYZE will still show actual row counts from the real (small) data. The extension changes the planner's cost estimates, not query execution.

The full workflow

Combining PostgreSQL 18's portable statistics with pg_regresql, the full workflow looks like this:

# 1. dump schema and statistics from production
pg_dump --schema-only -d production_db > schema.sql
pg_dump --statistics-only -d production_db > stats.sql

# 2. create test database
createdb test_db
psql -d test_db -f schema.sql

# 3. load minimal fixture data (optional)
psql -d test_db -f fixtures.sql

# 4. inject production statistics
psql -d test_db -f stats.sql

# 5. install pg_regresql and prevent stats from being overwritten
psql -d test_db <<SQL
ALTER DATABASE test_db SET session_preload_libraries = 'pg_regresql';
ALTER TABLE orders SET (autovacuum_enabled = false);
-- repeat for other tables
SQL

# 6. reconnect and verify (plans now match production)
psql -d test_db -c "EXPLAIN SELECT * FROM orders WHERE status = 'pending'"

Compatibility

The extension works with PostgreSQL 13 through 18. The portable statistics functions (pg_restore_relation_stats, pg_restore_attribute_stats) require PostgreSQL 18, but pg_regresql works with any method of writing to pg_class, including direct catalog updates on older versions.

PostgreSQL 19 will need a small update: the get_relation_info_hook used by pg_regresql has been replaced with build_simple_rel_hook. The new hook runs slightly later with different arguments, but the override logic stays the same.

Don't use this in production. The extension makes the planner ignore reality. That's exactly what you want for testing with injected statistics. In production, you want the planner to see the actual file sizes so it can adapt to data growth, bloat, and vacuum activity. Keep pg_regresql in your dev/test and CI databases.