The hidden cost of PostgreSQL arrays

Starting with arrays in PostgreSQL is as simple as declaring a column as integer[], inserting some values, and you are done.

Or building the array on the fly.

SELECT '{1,2,3}'::int[];
SELECT array[1,2,3];

  int4
---------
 {1,2,3}
(1 row)

  array
---------
 {1,2,3}
(1 row)

The official documentation provides a good introduction. But beneath this straightforward interface lies a set of more complex properties than most of us realise. Arrays in PostgreSQL are not just "lists" in a field. They have their own memory management strategy, their own index logic, and a lot of edge-case scenarios.

As it goes with boringSQL deep-dives, this article will explore the corners of array functionality that might break your production.

The document model temptation

Wait? Are we going to talk about JSONB arrays? Not at all. The whole concept of arrays in RDBMSs is actually document storage in disguise.

In database design, locality ensures faster retrieval times by keeping related data close on physical storage.Whether you use a distinct integer[] type or a JSON list [1, 2, 3], you are making the exact same architectural decision: you are prioritising locality over normalisation.

When you store tag_ids in an array, you are embedding related data directly into a row - just like a NoSQL database might embed subdocuments. This is not inherently wrong. Document databases exist for good reasons: they eliminate joins, simplify reads, and map naturally to application objects.

But PostgreSQL is a relational database. It was designed around the relational model, where:

foreign keys enforce referential integrity
joins connect normalised tables
updates modify individual rows, not entire lists

Arrays give you document-model convenience, but you lose relational promises. There are no foreign keys and no ON DELETE referential_action (like CASCADE) for array elements. If you delete a tags entry, the orphaned ID will remain in your array forever.

The rule of thumb is that if you find yourself in need of referential integrity - you most likely want a link table, not an array. Arrays are for data that shares the same lifecycle as the parent row. Not for relationships spanning across different tables.

A practical example is the author of a blog post (one author might write multiple other posts), whereas a whitelist of IP addresses for a service account is only applicable to the given entity.

With JSONB being so flexible, you might wonder why we still bother with quirkier native types. The answer lies in the 'boring' part of the database: predictability and efficiency. An integer[] column guarantees that every element is an integer.

Arrays are also more storage-efficient for primitives because they don't carry the metadata overhead of JSON objects.

The syntax gotchas

This post assumes basic knowledge of arrays. We won't cover the basics.

Arrays don't have to start at 1

By default, SQL arrays start at 1. And seemingly, there is nothing wrong with iterating through them in a fashion similar to:

FOR i IN 1 .. array_length(fruits, 1) LOOP
    RAISE NOTICE 'Index % contains: %', i, fruits[i];
END LOOP;

that is until you find an array with the arbitrary bounds. Which PostgreSQL allows.

SELECT '[-5:-3]={10,20,30}'::int[];

To make sure you iterate correctly through any given array, always use array_lower() and array_upper() in PL/pgSQL

SELECT array_lower('[-5:-3]={10,20,30}'::int[], 1);

 array_lower
-------------
          -5
(1 row)

or generate_subscripts() in SQL.

SELECT generate_subscripts('[-5:-3]={10,20,30}'::int[], 1);

 generate_subscripts
---------------------
                  -5
                  -4
                  -3
(3 rows)

Missing dimensions

When creating a table, you might expect strict typing. That's true for everything — except the array dimensions. You might think integer[][] enforces a 2D matrix. Except it does not. The [] syntax is effectively syntactic sugar. PostgreSQL does not enforce the number of dimensions of sub-arrays at the schema level at all by default.

CREATE TABLE dimension_test (
    matrix integer[][] 
);

INSERT INTO dimension_test VALUES ('{{1,2}, {3,4}}');

-- this is not going to fail
INSERT INTO dimension_test VALUES ('{1,2,3}');

-- 3D matrix works too
INSERT INTO dimension_test VALUES ('{{{1,2},{3,4}}, {{5,6},{7,8}}}');

If you want to enforce a specific array dimension, you cannot rely on the type definition. Instead, you must use a CHECK constraint.

CREATE TABLE strict_matrix (
    -- dims: ensure it's 2-Dimensional
    -- array_length: make sure it's exactly 3x3
    board integer[] CHECK (array_ndims(board) = 2 AND array_length(board, 1) = 3)
);

INSERT INTO strict_matrix VALUES (
    ARRAY[
        [0, 1, 0],
        [1, 1, 0],
        [0, 0, 1]
    ]
);

The only exception is that PostgreSQL enforces uniformity of arrays on every nesting level. This means it rejects sub-arrays with different sizes.

INSERT INTO dimension_test VALUES ('{{1,2}, {3}}');
ERROR:  malformed array literal: "{{1,2}, {3}}"
LINE 1: INSERT INTO dimension_test VALUES ('{{1,2}, {3}}');
                                           ^
DETAIL:  Multidimensional arrays must have sub-arrays with matching dimensions.

Slicing arrays

When accessing array values, it's important to consider that the syntax [1] and [1:1] are different. While the first one is an accessor, the second one acts like a constructor.

select matrix[1][1] from dimension_test ;
 matrix
--------
      1
(1 row)

When slicing an array, even if the slice is a single-element, it will be returned as a single-element array, not a scalar value.

select matrix[1:1][1:1], matrix[1][1:1], matrix[1:1][1] from dimension_test ;
 matrix | matrix | matrix
--------+--------+--------
 {{1}}  | {{1}}  | {{1}}
(1 row)

The ugly

Accessing array values has a forgiving behaviour, making it more difficult to find underlying bugs:

-- out-of-bound access returns NULL
SELECT (ARRAY[1,2,3])[10];
 array
--------
 (null)
(1 row)

-- out of bounds slicing returns an empty array
SELECT (ARRAY[1,2,3])[5:10];
 array
-------
 {}
(1 row)

But the single most confusing aspect that might trip you up coming from other programming languages is the fact that PostgreSQL treats multi-dimensional arrays as a single matrix, not an array of arrays.

-- wrong dimensionality 
SELECT (ARRAY[[1,2],[3,4]])[1];
 array
--------
 (null)
(1 row)

While in other languages you expect {1,2} as a result, PostgreSQL can't give it to you. It tries to return the first cell and fails (because the index is not complete).

And to paraphrase Fletcher's "Double Bubble" analogy (which also went very wrong — extra points for getting the reference), you can't fix this by using slice notation.

select ('{{1,2},{3,4}}'::int[])[1:1];
  int4
---------
 {{1,2}}
(1 row)

The only way to solve this puzzle is to unnest the slice and re-aggregate the results.

SELECT array_agg(val) FROM unnest(('{{1,2},{3,4}}'::int[])[1:1]) val;
 array_agg
-----------
 {1,2}
(1 row)

Just be aware that `array_agg` does not guarantee the order of aggregated elements unless you use an `ORDER BY` clause. While it usually works with simple `unnest` queries, relying on implicit ordering can be risky.

The other alternative is to cast it to JSONB and back. Not pretty no matter how you look at it. If you need to work with complex multi-dimensional structures where each sub-array has independent meaning, just use JSONB. It will do exactly what you expect.

Indexing arrays

While you can use a B-tree index on an array column, it won't help you unless you are looking for whole-array equality and sorting an array by dictionary rules.

-- which is bigger? B is correct
-- A: {1, 1000, 1000}
-- B: {2, 0}

Rendering B-tree indexes useless for any real-world index operations.

When working with arrays, you actually need GIN (Generalized Inverted Index). If a B-tree index is a phone book, GIN is an index at the back of the book. To query one or more elements, you need to find all possible locations and then intersect them to find the locations that match.

CREATE INDEX posts_by_tags ON posts USING GIN (tags);

GIN indexes are designed for set operations, making presence the key feature, while ignoring order. Here are operators that GIN provides for arrays:

Containment @> - matches rows that include ALL of the selected items.

-- match ALL tags
tags @> '{urgent, bug}'

Overlap && - does the row include ANY of the selected items.

-- match bug or feature
tags && '{bug, feature}'

There's also <@ and = (equality), but they should be easy to understand.

Two sides of `ANY`

This is where it gets interesting. While you might often hear (myself included) that you shouldn't use dynamic SQL for IN lists and should use ANY instead, there are some dangers you need to be aware of.

The ANY operator behaves very differently depending on which side of the comparison the array sits.

The advice to use ANY holds true when you are passing lists into the database. Instead of generating a query with 100 distinct parameters (WHERE id IN ($1, $2, ... $100)), which bloats your query cache and forces hard-parses, you should pass a single array parameter.

-- good: one parameter, one query plan
SELECT * FROM users WHERE id = ANY($1::int[]);

The trap is assuming this syntax is equally efficient when querying array columns. It is not. If you use ANY to check if a value exists inside a table column, you are effectively asking the database to loop.

-- bad: GIN does not support ANY; turning this into a seq scan
SELECT * FROM tickets WHERE 'feature' = ANY(tags);

When you write WHERE 'feature' = ANY(tags), you are not actually using an array operator. What you wrote is a scalar integer equality operator = applied inside a loop construct. Since the scalar operator = is not part of array_ops, the planner assumes the index cannot help and falls back to a sequential scan.

The correct way to rewrite the query is:

-- good again
SELECT * FROM tickets WHERE tags @> ARRAY['feature'];

Fast updates and the trade-off

Because a GIN index is built to work with sets, it is expensive to maintain. With a B-tree index, one row equals one index entry. In a GIN index, one row equals N index entries, where N is the number of elements in your array.

This leads to write multiplication. To prevent this, PostgreSQL defaults to using a "fast update" mechanism. This is a strategy where new entries are added to a pending list (an unsorted temporary buffer) and only merged into the main index structure later (during VACUUM).

Given the unsorted nature of a GIN index, it is an exception to VACUUM is a Lie, as VACUUM actually does perform structural maintenance here.While this makes INSERT operations manageable, it can slow down SELECTs. Every time you query the index, PostgreSQL must scan the organised main index plus the entire messy pending list. If that list grows large, your query performance might degrade.

If you operate a read-heavy workflow with infrequent writes, you should disable this to guarantee consistent reading performance.

CREATE INDEX posts_by_tags ON posts USING GIN (tags) WITH (fastupdate = off);

Storage and modification

Now it's time to return to the document model. In PostgreSQL, rows are immutable (MVCC); there is no such thing as an "in-place update", and arrays are stored as atomic values. This results in a very uncomfortable truth — to modify a single element of an array, PostgreSQL must copy and rewrite the entire row.

UPDATE user_activity
SET event_ids = event_ids || 10001
WHERE user_id = 50;

Every single append rewrites the entire array, which in effect results in a rewrite of the entire row.

TOASTed

When any array grows large enough (> 2 KB — see below), PostgreSQL moves it automatically to a separate storage area using TOAST. While this keeps the data row lean, it turns array updates into a severe performance bottleneck.

The difference this introduces is subtle but comes with a big impact. While a standard MVCC update simply copies the row version on the main heap, updating a TOASTed array forces PostgreSQL to fetch all external chunks, decompress the entire object into memory, apply the change, and then recompress and write the new full-size blob back to the TOAST table. This turns a simple modification into a CPU and I/O-intensive operation that rewrites the entire dataset rather than just the delta.

The threshold is derived from TOAST_TUPLES_PER_PAGE (default: 4), ensuring 4 tuples fit on a page.

Threshold: ~2 KBWhere does the 2 KB value come from? The TOAST threshold is calculated to ensure at least four tuples can fit on a single 8 KB heap page. PostgreSQL uses this to balance efficiency against the overhead of TOAST indirection.

The compressions

Before version 14, PostgreSQL relied on pglz — an algorithm prioritising compression ratio over speed. This made the "decompress-modify-compress" cycle of TOAST painful.

PostgreSQL 14 introduced LZ4 as an alternative:

ALTER TABLE articles ALTER COLUMN tags SET COMPRESSION lz4;

LZ4 is significantly faster for both compression and decompression, with only slightly lower compression ratios. If you are working with large arrays, switching to LZ4 is one of the easiest ways to reduce the CPU penalty of TOAST.

When a large array might make sense

You might have gained the impression that arrays are bad. When evaluating the use of arrays, the real question isn't how big the array is but rather how often do you modify it? An array of 10,000 elements that you write once and is read-only for the rest of its lifecycle is a completely valid use case. An array of 50 elements that you append to on every incoming request is the real villain here.

If you combine this with compression, you might get an interesting mix.

DROP TABLE IF EXISTS compression_test;
CREATE TABLE compression_test (
    id serial PRIMARY KEY,
    compressed_floats float4[], 
    raw_floats float4[]
);

-- do not compress raw_floats
ALTER TABLE compression_test ALTER COLUMN raw_floats SET STORAGE EXTERNAL;

-- insert semi-random data with low cardinality
INSERT INTO compression_test (compressed_floats, raw_floats)
SELECT semi_random_arr, semi_random_arr
FROM (
    SELECT ARRAY(
        SELECT floor(random() * 50)::float4 
        FROM generate_series(1, 10000)
    ) as semi_random_arr
) as generator;

SELECT 
    'Compressed (EXTENDED)' as strategy,
    pg_size_pretty(pg_column_size(compressed_floats)::bigint) as size_on_disk
FROM compression_test
UNION ALL
SELECT 
    'Raw (EXTERNAL)',
    pg_size_pretty(pg_column_size(raw_floats)::bigint)
FROM compression_test;

       strategy        | size_on_disk
-----------------------+--------------
 Compressed (EXTENDED) | 15 kB
 Raw (EXTERNAL)        | 39 kB
(2 rows)

Bulk loading with arrays

Up until now, it might seem that arrays don't actually bring many benefits. While they can have rough edges around storage, they are incredibly useful for transport.

The fastest way to insert 5,000 rows isn't a loop in your application, and it's definitely not a massive VALUES (...), (...) string. It's unnest.

INSERT INTO measurements (sensor_id, value, captured_at)
SELECT * FROM unnest(
    $1::int[],        -- array of sensor IDs
    $2::float[],      -- array of values
    $3::timestamptz[] -- array of timestamps
);

All that is needed is one network round trip, and one query to parse and plan. PostgreSQL handles the arrays row by row internally for you. This works both for UPSERTs and MERGE.

The cases for special arrays

Standard PostgreSQL arrays are polymorphic types (anyarray). This provides a powerful feature that allows a single function definition to operate on many different data types. They have to handle integers, strings, timestamps, and custom types equally well. But if you have specific data types, you can unlock significant performance gains by using specialised extensions.

The `intarray` extension

If you are dealing exclusively with 4-byte integers (int4/integer), the built-in array operations are leaving performance on the table. The intarray extension provides specialised functions and index operators that are significantly faster than the generic implementation.

To use it, you must explicitly enable it:

CREATE EXTENSION IF NOT EXISTS intarray;

The difference in developer ergonomics is immediate. To sort an array in standard SQL, you are forced to unnest, order, and re-aggregate. With intarray, you get native functions like sort() and uniq().

-- standard arrays
SELECT array_agg(val ORDER BY val)
FROM unnest('{3, 1, 2}'::int[]) val;

-- intarray
SELECT sort('{3, 1, 2}'::int[]);

Beyond raw management functions, it introduces a specialised query syntax that simplifies complex boolean logic. Instead of chaining multiple overlap (&&) and containment (@>) checks, you can express your requirements in a single "query string" using the @@ operator.

-- standard arrays
SELECT * FROM staff
WHERE available_days @> '{1}'          -- must include Mon
  AND available_days && '{6, 7}'       -- must include Sat OR Sun
  AND NOT (available_days @> '{2}');   -- must NOT include Tue

-- intarray
SELECT * FROM staff
WHERE available_days @@ '1 & (6 | 7) & !2';

The only catch is the type restriction. intarray is strictly limited to signed 32-bit integers. If your values exceed 2 billion, you are back where you started.

AI with `pgvector`

You cannot talk about arrays in 2026 without mentioning pgvector. While it markets itself as a "vector store", internally it is simply an array of floats with a different mathematical focus.

Standard arrays are binary: they care about Exact Matches (Overlap &&, Containment @>). Vectors are all about fuzzy distance (Cosine <=>, Euclidean <->).

If you are building search or recommendation features, pgvector allows you to treat your array column not as a list of "facts" (tag A, tag B), but as coordinates in semantic space.

Nevertheless, the architectural decision is exactly the same as using a standard array: you are trading strict structure for convenience. Since there is no way to "join" two rows based on how similar they are, you store the vector directly on the row. You accept a larger table size in exchange for the ability to ask, "What is close to this?".

PS: Thanks to Matthias Feist for inspiring this article.

The document model temptation

The syntax gotchas

Arrays don't have to start at 1

Missing dimensions

Slicing arrays

The ugly

Indexing arrays

Two sides of ANY

Fast updates and the trade-off

Storage and modification

TOASTed

The compressions

When a large array might make sense

Bulk loading with arrays

The cases for special arrays

The intarray extension

AI with pgvector

Two sides of `ANY`

The `intarray` extension

AI with `pgvector`