TOAST: Where PostgreSQL hides big values

In earlier posts in this series we established that every heap tuple lives inside a strict 8KB page. Everything else is built on top of that hard limit: MVCC, HOT updates, and indexes that point at (page, line_pointer). And yet this still works:

CREATE TABLE docs (id int PRIMARY KEY, body jsonb);
INSERT INTO docs VALUES (1, (SELECT jsonb_agg(g) FROM generate_series(1, 100000) g));

That body value is somewhere north of half a megabyte. The heap page is still 8KB. Both statements are true at the same time, and the mechanism that makes them coexist is TOAST: The Oversized-Attribute Storage Technique.

The 2KB threshold

The page is 8KB, but PostgreSQL starts shrinking tuples at around 2KB. The goal is to keep at least four tuples per page, so anything bigger triggers TOAST.

Two constants are at play. TOAST_TUPLE_THRESHOLD decides when the toaster runs; TOAST_TUPLE_TARGET is the size it tries to shrink the tuple down to. Both default to 2032 bytes on a standard 8KB build. The threshold is fixed at compile time; the target is per-table tunable via the toast_tuple_target storage parameter.

When an INSERT or UPDATE produces a tuple wider than the threshold, PostgreSQL doesn't wait to see whether the row would actually fail to fit on a page. It starts compressing and relocating attributes one at a time, and stops the moment the tuple drops below the target. A row only modestly over the line often gets away with one cheap compression pass; a much wider one walks the full procedure.

Only variable-length attributes are candidates. A bigint is always 8 bytes, so there is nothing to compress and nowhere to move it. A text, bytea, jsonb, numeric, or array, on the other hand, is a varlena: a length-prefixed structure PostgreSQL can freely rearrange.

The four storage strategies

Each variable-length column has a storage strategy: a flag in pg_attribute.attstorage that answers two yes/no questions for the toaster. Can this value be compressed? Can it be moved out-of-line? PostgreSQL picks a default based on the column's type, and you can change it with ALTER TABLE ... ALTER COLUMN ... SET STORAGE. There are four valid values:

Strategy	Compress?	Out-of-line?	Default for
`PLAIN`	no	no	fixed-width types
`EXTENDED`	yes	yes	most varlena types
`EXTERNAL`	no	yes	(opt-in)
`MAIN`	yes	last resort only	(opt-in)

EXTENDED is what you get for text, bytea, jsonb, and friends unless you ask for something else. The other three are levers for specific tradeoffs:

EXTERNAL skips compression entirely. Useful when the column is already compressed (images, gzipped blobs), or when you plan to use substring/byte-range reads, which work better on uncompressed out-of-line data.
MAIN tries hard to keep the value in the main heap tuple even after compression. It only moves out-of-line as a last resort.
PLAIN forbids both. Fixed-width types use it because they have no choice; you can set it on a varlena column too, but then any value that doesn't fit will fail the INSERT.

You can inspect and change strategies through \d+ or directly:

ALTER TABLE docs ALTER COLUMN body SET STORAGE EXTERNAL;

The shrinking procedure

Once the toaster is running, PostgreSQL walks the tuple's variable-length attributes in a fixed order, applying the cheapest transformations first:

Compress EXTENDED attributes, largest first. If any single attribute is large enough that it should clearly go out-of-line anyway, push it out immediately in the same pass.
If the tuple still doesn't fit, move remaining EXTENDED or EXTERNAL attributes out-of-line, largest first.
If still too big, compress MAIN attributes.
If still too big, move MAIN attributes out-of-line as a last resort.

Each pass stops as soon as the tuple drops below the target. A row with one chubby JSON column and a handful of small text fields usually only touches the JSON: a single compression pass typically does it.

The whole procedure lives in heap_toast_insert_or_update() in src/backend/access/heap/heaptoast.c, with the per-attribute helpers in src/backend/access/common/toast_helper.c. The four numbered loops in heap_toast_insert_or_update map one-to-one onto the four steps above. It's one of the more readable hot paths in the heap code if you want to see the policy in source form.

Watching it happen

To watch this happen we need a table with one wide column and the pageinspect extension we've been using throughout this series.

CREATE EXTENSION IF NOT EXISTS pageinspect;

CREATE TABLE toast_demo (
    id    integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    label text,
    body  text
);

Insert three rows of very different sizes:

INSERT INTO toast_demo (label, body)
VALUES
('tiny',         'hello'),
('medium',       repeat('compressible-text-', 200)),
('large-random', (SELECT string_agg(md5(g::text), '')
                   FROM generate_series(1, 1000) g));

The middle row is highly compressible: the same 18-byte sequence repeated 200 times. The third is effectively random (MD5 output concatenated), so compression won't help. Both are well over 2KB before TOAST gets involved.

Every table with a toastable column gets a companion TOAST table created for it automatically:

SELECT reltoastrelid::regclass FROM pg_class WHERE relname = 'toast_demo';

       reltoastrelid
---------------------------
 pg_toast.pg_toast_1367271

That pg_toast_<oid> table is where out-of-line chunks live. It has a fixed shape, the same for every TOAST table in the database:

\d pg_toast.pg_toast_1367271

   Column   |  Type
------------+---------
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Owning table: "public.toast_demo"
Indexes:
    "pg_toast_1367271_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

Every out-of-line value is sliced into chunks of up to TOAST_MAX_CHUNK_SIZE (~2000 bytes), each row identified by (chunk_id, chunk_seq). The primary key on those two columns is how PostgreSQL stitches chunks back together in order, on demand.

What the heap tuple looks like

Look at the main heap page now:

SELECT lp, lp_off, lp_len, t_xmin, t_xmax, t_ctid
FROM heap_page_items(get_raw_page('toast_demo', 0));

 lp | lp_off | lp_len | t_xmin | t_xmax | t_ctid
----+--------+--------+--------+--------+--------
  1 |   8152 |     39 | 414236 |      0 | (0,1)
  2 |   8040 |    108 | 414236 |      0 | (0,2)
  3 |   7976 |     59 | 414236 |      0 | (0,3)
(3 rows)

Three rows, three very different lengths on disk. The tiny row at lp 1 is 39 bytes: tuple header, the label, and 'hello' stored inline. The large-random row at lp 3 is 59 bytes despite carrying a 32KB body value, because the data went out-of-line and only an 18-byte TOAST pointer is left behind in the main heap tuple.

The interesting one is lp 2. 108 bytes is way too big for a TOAST pointer, and way too small for a 3.6KB string. That's because the medium value was compressed by pglz and stayed inline. The repeating 18-byte pattern crushed down to roughly 70 bytes; once the tuple was under the target, step 1 of the procedure was done and step 2 (move out-of-line) never ran. This is exactly the compress-first, relocate-only-if-needed policy in action: a highly compressible column can stay on the main heap and never produce a single TOAST chunk.

What sits at lp 3 in place of the value is an 18-byte TOAST pointer: 1-byte varlena header tagged 0x80, 1-byte kind discriminator (VARTAG_ONDISK), and then four fields, the uncompressed size, the on-disk size (with a compression-method tag baked in on PG14+), the chunk_id to look up, and the OID of the TOAST table to look in. The remaining bytes of the 59-byte tuple are the tuple header, the 'large-random' label, and the standard alignment padding.

If you go reading raw page hex, don't look for the usual 4-byte varlena length word in front of a TOAST pointer. There isn't one. External TOAST pointers always use the short 1-byte header form (0x80 marks them as "external, no inline length"), and the actual size is implied by the va_tag byte that follows. There's no alignment padding inside the pointer either; the 18 bytes are 1 + 1 + 16 contiguous.

The toast table itself has the real data:

SELECT chunk_id, chunk_seq, length(chunk_data)
FROM pg_toast.pg_toast_1367271
ORDER BY chunk_id, chunk_seq;

 chunk_id | chunk_seq | length
----------+-----------+--------
  1367279 |         0 |   1996
  1367279 |         1 |   1996
  1367279 |         2 |   1996
  1367279 |         3 |   1996
  1367279 |         4 |   1996
  1367279 |         5 |   1996
  1367279 |         6 |   1996
  1367279 |         7 |   1996
  1367279 |         8 |   1996
  1367279 |         9 |   1996
  1367279 |        10 |   1996
  1367279 |        11 |   1996
  1367279 |        12 |   1996
  1367279 |        13 |   1996
  1367279 |        14 |   1996
  1367279 |        15 |   1996
  1367279 |        16 |     64
(17 rows)

Only one chunk_id, because only the large-random value ended up out-of-line. The medium value never made it here at all: compression alone was enough, and it stayed in the main heap. large-random was effectively incompressible, so pglz sampled the start, gave up early, and the value was stored uncompressed out-of-line. The full 32KB got sliced into 16 chunks of ~2KB plus a 64-byte tail.

The TOAST table is a regular heap table. It has its own pages, its own MVCC, its own VACUUM accounting, and its own indexes. When you update a TOASTed column, the chunks for the old value remain dead until VACUUM removes them, which is why VACUUM (VERBOSE) on a wide table reports separate counts for the main relation and its TOAST relation.

Compression: pglz vs lz4

For years, PostgreSQL had exactly one compression algorithm for TOAST: pglz, a built-in LZ-family compressor. Since PostgreSQL 14, lz4 is also available. It's usually faster on both compress and decompress, with similar or slightly worse ratios on text-like data. The choice is per-column, with a cluster-wide default:

SHOW default_toast_compression;   -- 'pglz' or 'lz4'

ALTER TABLE docs ALTER COLUMN body SET COMPRESSION lz4;

A column's current setting is visible in pg_attribute.attcompression:

SELECT attname, attstorage, attcompression
FROM pg_attribute
WHERE attrelid = 'toast_demo'::regclass AND attnum > 0;

 attname | attstorage | attcompression
---------+------------+----------------
 id      | p          |
 label   | x          |
 body    | x          |

attstorage mirrors the four strategies (p=PLAIN, x=EXTENDED, e=EXTERNAL, m=MAIN). attcompression is a single char: p for pglz, l for lz4, and blank when the column uses the cluster default. Changing the setting only affects values written after the change. Existing rows keep whatever they were compressed with, because the compression method is recorded inside the TOAST pointer itself.

If you flip SET COMPRESSION on a table full of pglz data, nothing changes on disk until those rows are updated or rewritten by something like VACUUM FULL or CLUSTER. Mixed-compression columns are completely normal and supported.

Because the compression method is recorded inside each TOAST pointer, you can ask any individual row how it was compressed without trusting the column-level setting:

SELECT id,
       pg_column_compression(body) AS method,
       pg_column_size(body)        AS on_disk_bytes,
       octet_length(body)          AS logical_bytes
FROM toast_demo;

 id | method | on_disk_bytes | logical_bytes
----+--------+---------------+---------------
  1 |        |             6 |             5
  2 | pglz   |            72 |          3600
  3 |        |         32000 |         32000

pg_column_compression returns the compression method (pglz or lz4) when the on-disk bytes are actually compressed, and NULL otherwise. pg_column_size reads the on-disk size out of the TOAST pointer (or just reports the inline size). Both are cheap, neither has to fetch a chunk or decompress anything. octet_length (or length on a text value) is the expensive one: it has to detoast and decompress the value to count.

The three rows show three different outcomes:

Row 1: value is inline, never compressed (tiny string). method is NULL.
Row 2: value is inline and compressed. method is pglz; on_disk_bytes is far smaller than logical_bytes.
Row 3: value is out-of-line and uncompressed. method is NULL because pglz sampled the data, saw it wasn't compressing, and gave up. on_disk_bytes equals logical_bytes; the raw bytes live in the TOAST relation untouched.

NULL in method means "the on-disk bytes are not compressed" and tells you nothing about whether the value is inline or out-of-line. To distinguish those, compare on_disk_bytes to logical_bytes, or use PG17's pg_column_toast_chunk_id() which returns NULL for inline values and the actual chunk_id for out-of-line ones.

What it costs you

TOAST is transparent in the same way virtual memory is: a seamless abstraction until performance characteristics force you to look at the underlying plumbing.

A SELECT body FROM docs WHERE id = 1 for a 1MB out-of-line value does a lot more than read one heap tuple. PostgreSQL fetches the tuple, follows the TOAST pointer, looks up chunk_id in the TOAST index, reads every chunk in order from the TOAST relation, reassembles them, decompresses the result, and returns it. For a 1MB value with default chunk size, that's around 500 toast-table tuples plus their index lookups, on top of the main heap read.

The implications are practical:

SELECT * is more expensive than it looks. Every wide column gets detoasted even if your application throws the value away. Asking for only the columns you need is not just a style preference; on TOAST-heavy tables it's an I/O decision.
pg_column_size(body) is cheap; length(body) is not. pg_column_size reads the on-disk size right out of the TOAST pointer without fetching a single chunk. length and most other functions have to detoast and decompress the whole value first. substring(body, n, m) is a useful middle case: when the value is stored out-of-line and uncompressed (EXTERNAL), PostgreSQL can fetch only the chunks the substring actually touches.
WHERE body LIKE '%foo%' reads everything. A sequential scan that filters on a TOASTed column has to detoast every row to evaluate the predicate. This is one of the strongest arguments for an expression index, a GIN trigram index, or just storing a separate searchable summary column.
Updates to a TOASTed value rewrite the chunks. PostgreSQL does not patch individual chunks. Update one key inside a 1MB jsonb, and you get a brand-new value, a brand-new chunk_id, and a full new set of chunks. The old chunks become dead and wait for VACUUM. This is the operational cost behind the common advice "don't store frequently-updated big documents in a single column."

Limits

The varlena length field gives every variable-length value a 1GB ceiling per column per row. That's the hard limit. Anything larger has to live outside the database, typically in object storage, with the database keeping a reference. There is no setting that raises this limit.

The toast pointer itself is small (18 bytes) and adds almost nothing to the main tuple. So while a TOASTed row is logically huge, its footprint on the main heap page is tiny, which is the whole point. A row whose wide column was pushed out-of-line is around 60 bytes on the heap, the same density as any narrow table. The expense only materializes when you actually ask for the wide column.

Following a tuple through the toaster

Every wide tuple takes the same path, but exits at a different door. The diagram below traces it end to end: threshold check, compress EXTENDED, move EXTENDED/EXTERNAL out-of-line, compress MAIN, move MAIN out-of-line. Each step is followed by a size check that lets the tuple leave early.

Our demo rows mapped neatly onto this flow. The medium value compressed under the target on the first step and never left the main heap. The large-random value failed the first size check, got pushed out-of-line by step two, and only the small TOAST pointer made it back to the heap page.

Keeping the page invariant safe

Whenever a heap tuple would exceed about 2KB, PostgreSQL shrinks it: first by compressing variable-length columns, then by relocating them, largest first, into the table's companion TOAST relation. The main-heap tuple is left holding a small TOAST pointer in place of the value. The out-of-line bytes live as (chunk_id, chunk_seq, chunk_data) rows, indexed for ordered reassembly. Reads transparently glue the chunks back together and decompress on the fly. The page-level invariants, so one tuple per page slot, indexes pointing at (page, lp), MVCC and HOT operating on heap tuples, they all hold, because as far as the page is concerned, every value is small. TOAST is the trick that makes "every tuple fits in one page" still true in a world of megabyte JSON documents.