In the previous article we watched every UPDATE leave dead tuple behind. The same copy-on-write behaviour shows up from the operational angle in DELETEs are difficult. That's the tradeoff of MVCC and on the heap alone it's tolerable. The problem is the indexes.
Every UPDATE in PostgreSQL potentially writes to every index on the table, even when the indexed columns didn't change. Five indexes, one updated column? Five extra index writes, five new entries to vacuum, five times the WAL traffic. At thousands of updates per second this becomes the dominant cost of running a write-heavy table.
Heap-Only Tuple (HOT) updates are PostgreSQL's escape hatch from this problem. They are, in my opinion, the single cleverest optimization in the storage engine. Let's trace exactly how they work.
Cost of a normal UPDATE
Without HOT, index maintenance scales poorly. Here's a table with multiple indexes:
pageinspect ships with the contrib modules and is available on most installations. It exposes raw page contents, useful for understanding storage, but never expose it to application users.CREATE EXTENSION IF NOT EXISTS pageinspect;
CREATE TABLE hot_demo (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
name text NOT NULL,
status text NOT NULL DEFAULT 'active',
score numeric(10,2)
);
CREATE INDEX idx_hot_name ON hot_demo (name);
CREATE INDEX idx_hot_status ON hot_demo (status);
INSERT INTO hot_demo (name, status, score) VALUES
('alice', 'active', 95.00),
('bob', 'active', 82.50),
('carol', 'active', 77.25);
Updating the indexed name column requires substantial background work.
UPDATE hot_demo SET name = 'BOB' WHERE id = 2;
PostgreSQL has to:
- Set
t_xmaxon the old tuple to the current transaction ID, marking it dead - Create a new tuple
- Insert a primary key index entry pointing to the new tuple's
ctid - Insert a new entry into
idx_hot_namepointing to the new tuple'sctid - Insert a new entry into
idx_hot_statuspointing to the new tuple'sctid
The id and status values remain identical, yet their respective indexes require new entries because the physical tuple location changed. The old index pointers become dead weight, pending removal by VACUUM.
Bypassing index updates
Index entries map key values to a physical tuple location (the ctid). If the new tuple is reachable through the old ctid, the index references remain valid and require no updates.
A ctid is a tuple's physical address: (page_number, line_pointer). Indexes store ctids, not row contents.
PostgreSQL uses a HOT update when two specific conditions are met. The new tuple must fit on the same page as the old tuple. And none of the updated columns are indexed.
If either condition fails, it's cold update with the full index dance. If both hold, the update goes HOT.
Demonstrating a HOT update
Let's see it. We update score, a column that isn't in any index:
UPDATE hot_demo SET score = 99.00 WHERE name = 'alice';
The pageinspect extension shows the HOT flags which are stored in t_infomask2.
SELECT
lp, lp_off, lp_flags, lp_len,
t_xmin, t_xmax, t_ctid,
CASE WHEN (t_infomask2 & x'4000'::int) > 0 THEN 'HOT_UPDATED' END AS hot_old,
CASE WHEN (t_infomask2 & x'8000'::int) > 0 THEN 'HEAP_ONLY' END AS hot_new
FROM heap_page_items(get_raw_page('hot_demo', 0));
The page also still carries the leftovers from the earlier name = 'BOB' cold update (dead bob at lp 2, new BOB at lp 4). Focus on lp 1 and lp 5, the alice update we just ran.
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_ctid | hot_old | hot_new
----+--------+----------+--------+--------+--------+--------+-------------+-----------
1 | 8144 | 1 | 46 | 411759 | 411761 | (0,5) | HOT_UPDATED |
2 | 8096 | 1 | 46 | 411759 | 411760 | (0,4) | |
3 | 8048 | 1 | 48 | 411759 | 0 | (0,3) | |
4 | 8000 | 1 | 46 | 411760 | 0 | (0,4) | |
5 | 7952 | 1 | 46 | 411761 | 0 | (0,5) | | HEAP_ONLY
(5 rows)
Line pointer 1 marks the old tuple as dead with t_xmax = 411761. Its t_ctid points forward to (0,5). The HOT_UPDATED flag confirms the successor is a heap-only tuple.
Line pointer 5 holds the new tuple. The HEAP_ONLY flag means no index entries point directly to this tuple. It is only reachable by following the chain from line pointer 1.
The indexes remain untouched. They still point to (0,1). During an index scan, PostgreSQL reads the HOT_UPDATED flag at line pointer 1, follows t_ctid to (0,5), and returns the current version.
The ctid chain
Successive HOT updates form confined chains in a single page.
UPDATE hot_demo SET score = 98.00 WHERE name = 'alice';
UPDATE hot_demo SET score = 97.00 WHERE name = 'alice'; lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_ctid | hot_old | hot_new
----+--------+----------+--------+--------+--------+--------+-------------+-----------
1 | 8144 | 1 | 46 | 411759 | 411761 | (0,5) | HOT_UPDATED |
2 | 8096 | 1 | 46 | 411759 | 411760 | (0,4) | |
3 | 8048 | 1 | 48 | 411759 | 0 | (0,3) | |
4 | 8000 | 1 | 46 | 411760 | 0 | (0,4) | |
5 | 7952 | 1 | 46 | 411761 | 411762 | (0,6) | HOT_UPDATED | HEAP_ONLY
6 | 7904 | 1 | 46 | 411762 | 411763 | (0,7) | HOT_UPDATED | HEAP_ONLY
7 | 7856 | 1 | 46 | 411763 | 0 | (0,7) | | HEAP_ONLY
(7 rows)
The chain routes through lp 1 -> lp 5 -> lp 6 -> lp 7. Intermediate versions carry both HOT_UPDATED and HEAP_ONLY flags. They are dead links in the middle of the chain.
And here's the part that makes this work: the indexes still have exactly one entry for alice, pointing to (0,1). No matter how many times we HOT-update her score, the indexes never grow. PostgreSQL walks the chain from (0,1) through the dead intermediates until it lands on the live tuple at (0,7).
Page pruning
When PostgreSQL accesses a page with dead HOT chain members, it can clean up opportunistically. No VACUUM, no background worker, no special scheduling. Just a regular query that happens to touch the page and decides there's work worth doing.
But "can" is doing real work in that sentence. Pruning is gated by two checks, not one. The first is the visibility check: pd_prune_xid < RecentGlobalXmin, meaning the dead tuples are invisible to every running transaction. The second is a cost check: the page has to be full enough that compacting it is worth the cycles. The threshold is roughly 10% free space. Above that, PostgreSQL leaves the page alone even when there's pruneable garbage on it.
Our demo page has seven tiny tuples and most of the page is empty. A plain SELECT against it does nothing. The flags don't move, t_xmax stays set on the dead intermediates, the chain stays long. On a busy table where pages actually fill up, pruning fires constantly during regular SELECTs and UPDATEs. On a quiet demo table, you can stare at it all day.
To force the issue on our demo table, run VACUUM. VACUUM always prunes, regardless of free space:
VACUUM hot_demo;
Re-inspecting the page reveals the structural changes.
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_ctid | hot_old | hot_new
----+--------+----------+--------+--------+--------+--------+---------+-----------
1 | 7 | 2 | 0 | | | | |
2 | 0 | 0 | 0 | | | | |
3 | 8144 | 1 | 48 | 411759 | 0 | (0,3) | |
4 | 8096 | 1 | 46 | 411760 | 0 | (0,4) | |
5 | 0 | 0 | 0 | | | | |
6 | 0 | 0 | 0 | | | | |
7 | 8048 | 1 | 46 | 411763 | 0 | (0,7) | | HEAP_ONLY
(7 rows)
- Line pointer 1 transitions to
LP_REDIRECT(lp_flags = 2). It abandons its tuple data and points directly to line pointer 7. - Line pointers 2, 5, and 6 become
LP_UNUSED(lp_flags = 0). Their tuple space is reclaimed for future inserts. (Line pointer 2 was the oldbobtuple from the cold update earlier, finally cleaned up by VACUUM as well.) - Line pointer 7 is compacted. The page is defragmented, packing surviving tuples tightly at the end of the page.
LP_REDIRECT
LP_UNUSED, LP_NORMAL, LP_REDIRECT, LP_DEAD) were introduced in Inside the 8KB Page. HOT pruning is where LP_REDIRECT earns its keep.
PostgreSQL retains the redirect at line pointer 1 because the indexes still reference (0,1). Freeing it would leave dangling pointers. The redirect only occupies the line pointer slot itself (four bytes), with no tuple data attached, and persists until a future VACUUM rewrites the index entries to point at (0,7) directly. Only then can line pointer 1 finally become LP_UNUSED.
When HOT fails
It helps to see the opposite case. Let's update name, which is indexed:
UPDATE hot_demo SET name = 'ALICE' WHERE name = 'alice';SELECT
lp, lp_flags, t_xmin, t_xmax, t_ctid,
CASE WHEN (t_infomask2 & x'4000'::int) > 0 THEN 'HOT_UPDATED' END AS hot_old,
CASE WHEN (t_infomask2 & x'8000'::int) > 0 THEN 'HEAP_ONLY' END AS hot_new
FROM heap_page_items(get_raw_page('hot_demo', 0)); lp | lp_flags | t_xmin | t_xmax | t_ctid | hot_old | hot_new
----+----------+--------+--------+--------+---------+-----------
1 | 2 | | | | |
2 | 1 | 411764 | 0 | (0,2) | |
3 | 1 | 411759 | 0 | (0,3) | |
4 | 1 | 411760 | 0 | (0,4) | |
5 | 0 | | | | |
6 | 0 | | | | |
7 | 1 | 411763 | 411764 | (0,2) | | HEAP_ONLY
(7 rows)
No HOT_UPDATED on lp 7, no HEAP_ONLY on lp 2. This is cold update. The new tuple landed at line pointer 2 (the slot freed by VACUUM a moment ago) and lp 7 was stamped dead with t_ctid pointing forward. PostgreSQL also had to insert a new entry into idx_hot_name (the indexed value changed), and because it's cold, it added new entries into idx_hot_status and the primary key index even though neither of those values changed.
The second common failure mode is space. Even if you only touch unindexed columns, completely full page leaves PostgreSQL no choice but to put the new tuple on a different page. And a cross-page update always means index maintenance, because the ctid changes.
fillfactor: making room for HOT
HOT updates need free space on the actual page. By default, PostgreSQL writes pages with a fillfactor of 100, packing them full during inserts. The first UPDATE on a full page has to allocate the new tuple elsewhere, and HOT is off the table.
Lowering the fillfactor reserves space specifically for updates.
CREATE TABLE hot_ff_demo (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
val text,
counter integer DEFAULT 0
) WITH (fillfactor = 80);
A fillfactor of 80 leaves 20% of the page empty during inserts, giving HOT chains room to grow.
fillfactor depends on your workload. Read-heavy or insert-only tables: leave it at 100. Wasting 20% of storage for updates that never happen is pointless. Frequent updates on non-indexed columns: 80-90 is a common choice. Wide rows with heavy churn: 50-70. The trade-off is storage. A lower fillfactor means more pages, which means more I/O for sequential scans.
You can measure HOT success rates through pg_stat_user_tables.
SELECT
relname,
n_tup_upd,
n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 0
ORDER BY n_tup_upd DESC;
A dropping hot_pct on an update-heavy table is signal worth investigating. The usual suspects:
- Pages are full. Lower the fillfactor or make sure VACUUM runs frequently enough to reclaim space before updates need it.
- Indexed columns are being updated. ORMs love to send
UPDATE users SET name = 'alice', email = 'alice@example.com', updated_at = now()when onlyupdated_atchanged. Ifnameoremailare indexed, that's a cold update every time. Look for "dirty tracking" features in your ORM. - Too many indexes. Every index adds a column to the "no indexed columns can change" condition. Review whether all of them are actually used. An unused index doesn't just waste space, it prevents HOT updates on every column it covers.
pd_prune_xid revisited
Back to the page header from Inside the 8KB Page. After a HOT update, pd_prune_xid gets stamped with the t_xmax of the dead tuple:
UPDATE hot_demo SET name = 'Alice' WHERE name = 'ALICE';
SELECT prune_xid FROM page_header(get_raw_page('hot_demo', 0)); prune_xid
-----------
411764
(1 row)
The name update here is cold because name is indexed. Cold updates seed pd_prune_xid too, since dead tuples need pruning either way. Once set, the field tracks the oldest unpruned t_xmax on the page, so newer dead tuples don't bump it forward.
That prune_xid of 411764 is the t_xmax of the oldest dead tuple still waiting to be pruned (the ALICE tuple at lp 7, killed by this latest update). The next time any backend touches this page, it compares pd_prune_xid against RecentGlobalXmin. If the dead tuples are guaranteed invisible to everyone and the page is full enough to be worth compacting, pruning fires.
RecentGlobalXmin is the oldest snapshot xmin across every active transaction, replication slot, and prepared transaction on the cluster. It's the horizon below which "definitely invisible to everyone" is safe to assume.
This is why pruning is described as "opportunistic" or "lazy". PostgreSQL doesn't schedule it, doesn't run it in the background, and doesn't bother with pages that have plenty of room. It checks two cheap conditions on every page access and cleans up on the spot when both hold. On a busy table, pages get pruned constantly during regular SELECTs and UPDATEs. On a half-empty page, dead tuples linger until either the page fills up or VACUUM runs.
RecentGlobalXmin stays pinned to its snapshot. No tuples created after that point can be pruned, no matter how dead they are. Same root cause that blocks VACUUM, and one of the reasons long-running transactions in PostgreSQL are so dangerous.
HOT in one paragraph
When PostgreSQL gets an UPDATE, it checks two things: is there room on the same page, and are all the changed columns unindexed? If yes to both, the update goes HOT:
- The new tuple lands on the same page with the
HEAP_ONLYflag. - The old version is stamped
HOT_UPDATED, witht_ctidpointing forward to the new one. pd_prune_xidis set, flagging the page as having dead versions to clean up.- No index entry is created or touched.
Later, when any backend reads the page, it checks two things: are the dead versions invisible to everyone, and is the page full enough to be worth compacting? When both hold, pruning fires: intermediate chain members become LP_UNUSED, the original line pointer turns into an LP_REDIRECT that still satisfies every index reference, and the page is defragmented. The reclaimed space is ready for the next HOT update.
With the right fillfactor, this loop runs forever, and VACUUM stays largely out of the picture.
What comes next
Page pruning is lightweight cleanup that happens during normal reads and writes. It handles dead tuples in HOT chains and reclaims their space on the page. But it has limits. It cannot touch index entries. It can flag dead tuples from cold updates but can't fully reclaim them, since the index entries still need to come out. It cannot set visibility map bits or update the free space map. And it cannot reach across pages.
For all of that, you need VACUUM. And as we covered in VACUUM Is a Lie (About Your Indexes), even VACUUM has its own blind spots once indexes enter the picture.