The work around RegreSQL led me to focus a lot on buffers. If you are a casual PostgreSQL user, you have probably heard about adjusting shared_buffers and followed the good old advice to set it to 1/4 of available RAM. But after we went a little bit too enthusiastic about them on a recent Postgres FM episode I've been asked what that's all about.

Buffers are one of those topics that easily gets forgotten. And while they are a foundation block of PostgreSQL's performance architecture, most of us treat them as a black box. This article is going to attempt to change that.

The 8KB page

There's one concept we need to cover before diving into the buffers. And that's the concept of the 8KB page. Everything in PostgreSQL is stored in blocks that are 8KB wide.

When PostgreSQL reads the data, it does not read individual rows. It reads the entire page. When it writes something, same thing - same page. You want to retrieve one small row, you will always retrieve much more data to go along with it. And if you followed carefully, same applies to writes.

-- you can check the block size (which should be almost always 8192 bytes)
show block_size;

 block_size
------------
 8192
(1 row)

Every table and index is a collection of these pages. A row might span multiple pages if it's large enough, but the page remains the atomic unit of I/O.

PostgreSQL vs OS

The interesting part is understanding why PostgreSQL needs to maintain its own infrastructure for its own buffer cache, when the operating system can already cache disk pages.

The answer is quite simple. PostgreSQL understands the data it reads. Whilst the operating system only sees files and bytes. PostgreSQL sees tables, indexes, query plans and has semantic knowledge to cache things faster.

Consider this example: a query needs to perform a sequential scan of a large table. The OS might happily cache all those pages, but PostgreSQL knows this is a one-time operation and uses a special strategy (ring buffers) to avoid eviction of the main cache.

The second important aspect is ACID - or better, the guarantee that write-ahead log (WAL) reaches stable storage before the data page is modified. The OS does not differentiate and can't effectively guarantee this durability requirement (only at the cost of performance impact).

shared_buffers

Now we can move to what we all know is the main PostgreSQL "cache". The shared_buffers parameter controls the size of the shared memory, accessible to all backend processes. If any backend needs to retrieve a page, it first checks the shared buffers. If the page is present - it's a hit. No disk I/O is needed. Miss? Read it from disk (or OS cache) and store it in shared buffers for next time.

WAL has its own buffer area (wal_buffers). This separation exists because WAL writes are sequential and must be persisted before the corresponding data change can be considered committed. The default is 3% of shared_buffers, capped at 16MB (one WAL segment).
show shared_buffers;

 shared_buffers
----------------
 128MB
(1 row)

The default value 128MB is very conservative and is part of making sure the PostgreSQL default installation will work pretty much on any system (including those with limited RAM). But in your regular environment this value should typically be much higher.

The value of 128MB there refers to the actual buffered content. If you consider a single page being 8KB, you can imagine this as 16,384 individual slots for storing the data.

Interactive Visualizer

Explore how PostgreSQL's buffer pool actually works. Watch the clock sweep algorithm evict cold pages, see buffers transition between clean, dirty, and pinned states, and understand how the hash table enables O(1) lookups.

Open Visualizer

However, there's more to the buffer pool than just the pages themselves. PostgreSQL needs to track metadata and provide fast lookups, so the shared memory area is organized into three components:

  • buffer blocks - the actual 8KB pages where the data lives
  • buffer descriptors - a parallel array of ~64-byte structures, one per slot.
  • hash table used for mapping page identifiers to individual buffer slots.

Each descriptor then tracks which page is cached in the slot (tag), flags about the state (dirty, valid and I/O in-progress), and pin/usage counters.

O(1) = constant time, regardless of buffer pool size.
The hash table enables fast lookups. When a backend needs a specific page, it hashes the page identifier and jumps directly to the right bucket—no need to scan all 16,384 slots. This keeps buffer lookups at O(1) regardless of pool size.

When a backend needs page N of table orders, it hashes the identifier, looks up the hash table - which drives the hit/miss logic.

Pin and usage counts

What happens to each buffer slot is ruled by the two counters mentioned above: pin and usage count.

Pin count tracks active references. When a backend (for example a running query) is actively reading or modifying a page, it pins the buffer so it can't be evicted. When the backend finishes, it unpins the buffer.

Usage count tracks how recently and frequently a buffer was accessed. Each access increments the count (capped at 5). During eviction, the clock sweep decrements this value—buffers with higher counts survive longer, while those at 0 get evicted.

The usage counter is important to avoid behaviour when a single sequential scan would flush the entire buffer pool. Imagine reading a full 1GB table. Without this protection, it would evict everything in shared buffers, ignoring frequently accessed data. We will also touch on this specific behaviour later, in the ring buffers.

PostgreSQL provides you functionality to examine what's happening in the shared buffer cache in real time using extension pg_buffercache.

The clock sweep algorithm

Now that we covered tracking of the usage, what happens when PostgreSQL needs to load a page and all slots are taken? It needs to try to make some space.

A simple LRU check would represent much higher maintenance; updating the linked list on each buffer load would dramatically increase the complexity.
That's where the clock sweep algorithm comes in. Why clock? Imagine the buffer pool as a circular clock, and the algorithm always moves forward, sweeping along the way. As it passes each slot it
  1. Is the buffer pinned? Skips it as it's being used.
  2. Is the usage counter > 0, reduce it by one and continue to next slot (i.e. it survives till next round).
  3. If the usage counter hits 0 and it's not in use, it's going to be evicted.

The clock sweep is a simple way to ensure cold pages get evicted quickly, while hot pages tend to stick and survive multiple sweeps.

Dirty buffers and the background writer

Don't forget that a change is always first written to WAL.
Up until now we have discussed that a buffer loaded from disk is the same one in the shared buffers cache. When a backend modifies the page, the buffer becomes "dirty" but it's not immediately written to storage. Dirty buffers represent I/O work that has not yet happened. As it would be inefficient to write it immediately. The same page can be modified multiple times in short bursts, making the previous I/O operations unnecessary.

Instead dirty pages accumulate until one of the following events happen.

PostgreSQL writes all dirty buffers to disk during checkpoint. This is a periodic process (which you can force using the CHECKPOINT command). This is a point where data on disk becomes consistent. After its successful completion, crash recovery only needs to replay write-ahead logs from that moment forward.

The second mechanism is the background writer. It continuously scans for dirty buffers and writes them before anyone needs to evict them. This way, when a backend runs clock sweep, it finds clean buffers ready to evict, without needing to wait for I/O. It also spreads writes over time instead of bursty spikes during eviction pressure.

And as suggested the last chance to write the dirty pages to the disk is when the clock sweep finds a dirty buffer. It must write the data to disk so it can be reused for the new page. This is the worst case as it resorts to synchronous I/O operation.

The ultimate goal is to balance the clean buffers available, and therefore preventing backends from being blocked by synchronous write during the eviction. If you followed carefully you can see how bad flow or settings can cause a problem. Eviction happens when new buffers are loaded (I/O operation), and if the dirty buffer is evicted you force another I/O operation.

Ring buffers

As mentioned above there are also special types of buffers. We already touched the scenario when a query performs the scans of a large table.

In a naive implementation, the sequential scan would load all data into shared buffers, evicting everything else along the way. Your warmed cache would vanish and subsequent queries would suffer for minutes to come.

PostgreSQL's solution for this case is ring buffers. Small, private buffer pools for bulk operations. Instead of using the shared buffer pool, certain operations get their own limited ring.

The individual cases are:

Sequential scans on large tables over 1/4 of shared_buffers use a dedicated 256 KB ring buffer. Pages cycle through this tiny ring and never touch the main cache.

-- perform a sequential scan over a large table
EXPLAIN (ANALYZE, BUFFERS) SELECT count(1) FROM ring_buffer_test;

Which would show you Buffers: shared read=127285 instead of hit. I will follow with the separate article on how to read buffers in explain plans.

Bulk writes (COPY, CREATE TABLE AS) use a 16MB capped ring buffer, large enough for efficient batching, yet small enough not to pollute the shared buffer pool.

As VACUUM touches every page and shouldn't evict the hot data, it uses its own dedicated ring buffer. Historically it was set to 256KB, but since PostgreSQL 17 it can be configured using vacuum_buffer_usage_limit (while previous two are given).

Local buffers

The second kind of exception to shared buffer pool, is the session based temporary tables. In this case as the concurrency is out of question, each backend has its own local buffer pool, controlled by temp_buffers (8MB default).

Local buffers are faster than shared buffers because they have simpler locking. There is no need for the heavy cross-process coordination required in the main cache.

While this might seem like a minor implementation detail, it can translate into a powerful optimisation strategy. Many developers tend to default to complex CTE logic for intermediate data, but using temporary tables offers a distinct advantage in the form of lower I/O as the changes to temporary tables are not WAL logged and by their nature reduce pollution of shared buffer pools.

If your workload involves large temporary tables, increasing temp_buffers can help keep those operations purely in RAM. However, remember that this memory is per-connection, so it multiplies across all backends.

The OS Page Cache

PostgreSQL doesn't bypass the operating system. Every read and write goes through the kernel, which maintains its own page cache. This creates double buffering - the same 8KB page can exist in both PostgreSQL's shared buffers and the OS cache simultaneously.

The OS cache acts as a "Level 2 cache" - when PostgreSQL evicts a page, it often still lives in OS memory.

This sounds wasteful, but it's actually a feature. When PostgreSQL evicts a clean page to make room, that page drops into the OS cache rather than disappearing. A moment later, if you need it back, the OS serves it from RAM - no disk I/O required.

The OS also provides read-ahead - detecting sequential access patterns and pre-loading pages before PostgreSQL requests them.

This relationship explains the classic advice: set shared_buffers to 25% of RAM. You're deliberately leaving space for the OS to act as a safety net.

-- on dedicated servers with large RAM, 40% can work well
-- but always leave room for OS cache and other processes
ALTER SYSTEM SET shared_buffers = '8GB';
This parameter allocates no memory - it's purely a hint for cost estimation.

PostgreSQL needs to know about this combined cache for query planning. That's where effective_cache_size comes in - it tells the planner how much total cache (shared buffers + OS) to assume when estimating costs.

-- estimate of total cache available (shared + OS)
SHOW effective_cache_size;

A higher value encourages the planner to favor index scans, assuming data is likely cached somewhere even if not in shared buffers.

Conclusion

Buffers are one of the keystones of PostgreSQL internals. They control whether the query will hit fast RAM or slow disk; and at the same time they play a crucial role in the fragile balance of dirty pages, WAL and durability.

The shared buffer pool isn't just a cache - it's a sophisticated memory manager with clock sweep eviction, usage count decay, ring buffer isolation, and background maintenance. Understanding these mechanisms helps you tune effectively and diagnose problems when they start.