Postgres 100M+ row performance: Seven principles
Past 100M rows, 'add an index' stops working. Seven field-tested Postgres principles from three of our products: indexing, partitioning, vacuum, pooling, MVs, statistics, monitoring.
If one of your Postgres tables is approaching 100 million rows and the classic advice has run out, this post shares the seven principles we distilled from three of our own products. The “Postgres just scales” line started cracking the moment three of our products crossed 100 million rows in a single table. Postgres alone does scale — but without the right index choice, partitioning, vacuum tuning, and connection management, the scaling stops at a wall around 50-80 million rows. The world on the other side of that wall is the place where “read your EXPLAIN ANALYZE output and add the missing index” stops being enough.
This post shares seven principles distilled from our customer engagements and our own product work. Each one is well known on its own — but unless they run together, they fall short of what 100M+ rows demand. Throughout the post we share concrete Postgres settings, real production stories, and practical thresholds for when each principle starts mattering. Each principle could be its own deep post; what we leave here is the 20-30 sentence summary we keep coming back to in the field.
We touched on some of these principles in our multi-tenant SaaS isolation post; this one goes deeper at the single-table scale.
Principle 1: Index choice — going beyond B-tree
Past 100M rows, B-tree stops being the right index by default. Three alternatives need disciplined evaluation.
B-tree. The default for equality and range queries. Right for primary keys, foreign keys, and frequent WHERE columns. But B-tree disk cost grows linearly with row count; a 100M-row B-tree index sits at 5-15 GB. That means it does not fit in memory; lookups fall to disk I/O.
BRIN (Block Range INdex). Excellent for ordered data — especially time-series tables. When the table’s physical order matches the column you query, a BRIN index runs at 1-2% of B-tree size. On a 100M-row log table, B-tree at 12 GB versus BRIN at 80-150 MB; a giant gain rarely talked about. Single condition: the table is physically ordered by the column (time-series gets this for free).
GIN. For full-text search, jsonb, and array columns. With Postgres tsvector and the jsonb_path_ops operator class, GIN delivers sub-second lookups at 100M+ rows. B-tree does not work for these column types; there is no real alternative to GIN.
Partial index. If queries hit 5-15% of the table, a partial index (WHERE status = 'active') is 5-10x faster than a full B-tree. Disk cost drops proportionally. Across half our customer audits, a single “why are we indexing every row” question saved 70-80% disk on the index.
Field rule: past 100M rows, half your existing B-tree indexes are the wrong choice. In the first week of an audit, drop indexes with low idx_scan counts in pg_stat_user_indexes — unused indexes slow down INSERTs and UPDATEs.
Principle 2: Partitioning strategy
Past 100M rows, managing a single table becomes impractical. VACUUM times stretch, index maintenance gets expensive, the query planner becomes unstable. The fix is partitioning — but the right strategy depends on the table.
Range partitioning. The canonical pick for time-series tables. Monthly (or weekly) partitions make hot/cold separation physical. Old monthly partitions can run BRIN plus read-only flag, while the hot partition runs B-tree plus aggressive autovacuum tuning. The pg_partman extension automates partition creation and drop; install it instead of doing it manually.
List partitioning. For separation by tenant ID, region, or category. In a SaaS with 50-200 tenants, list partitioning by tenant ensures one tenant’s queries do not affect the others. Past 200 tenants the partition count becomes unmanageable; that is when you switch to hash partitioning.
Hash partitioning. For evenly distributed write load. Postgres hash partitions perform best at 8, 16, or 32 partitions. Lock contention on a single table dissipates under high-concurrency INSERT load. Trade-off: dropping a single partition (to delete old data) is not as clean as range; every partition needs review.
Practical decision: if you have a timestamp column, use range. If you need tenant or region isolation under 200, use list. If you need even write distribution, use hash. If you need both, range plus hash sub-partitioning is supported from Postgres 11 onwards.
Principle 3: VACUUM and autovacuum tuning
For 100M-row tables, default autovacuum settings are not enough. autovacuum_vacuum_scale_factor (default 0.2) means “do not run VACUUM until 20% of the table is dead tuples” — at 100M rows, that is a 20M dead-tuple threshold. The table is already slow well before that threshold.
Field tuning:
autovacuum_vacuum_scale_factor = 0.05(table-level override for large tables)autovacuum_vacuum_threshold = 10000(default 50)autovacuum_vacuum_cost_limit = 2000(default 200, more aggressive vacuum)autovacuum_vacuum_cost_delay = 10ms(default 20ms — Postgres 12+)autovacuum_max_workers = 6(default 3, for parallel partitions)
Per-table override:
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_analyze_scale_factor = 0.02
);
Also: monitor the HOT (Heap-Only Tuple) update ratio in pg_stat_user_tables. If HOT ratio sits below 80%, lower the fillfactor (ALTER TABLE x SET (fillfactor = 80)); updates stay on the same page, and index bloat drops.
VACUUM FULL is the tool to avoid, but sometimes it is necessary: if the table passes 30% bloat, run an online repack with the pg_repack extension. VACUUM FULL takes an exclusive lock; in production that is 10-20 minutes of downtime.
Principle 4: Connection pooling — PgBouncer in transaction mode
Postgres opens a process for every connection; default max_connections = 100. Once the application tier opens 200-500 connections, process explosion begins; each process eats 5-10 MB of RAM, context switches climb.
Solution: PgBouncer (or PgCat) in front, in transaction mode. PgBouncer accepts clients fast and only holds a Postgres connection for the duration of a transaction. Typical gain: 500 application connections served by 25-40 Postgres connections.
Transaction mode rules:
- No session-level state (
SET LOCALyes,SETno). - Watch out for prepared statement support (PgBouncer 1.21+ supports it; older versions are flaky).
LISTEN/NOTIFYdoes not work; if you need it, the application tier needs a separate channel.
Practical config:
pool_mode = transaction
default_pool_size = 25
reserve_pool_size = 5
max_client_conn = 1000
server_reset_query = DISCARD ALL
In three customer engagements, switching to PgBouncer dropped p95 latency by 30-45%; the cause was the Postgres CPU shifting from open/close handshakes to actually running queries. This is the reason “current connection count” is one of the questions on our discovery call prep list.
Principle 5: Materialized views vs CDC + downstream
Pulling heavy analytical queries off the OLTP table at 100M+ rows comes down to two options: a materialized view or change data capture (CDC) into a separate analytical store downstream.
Materialized view. Inside the same Postgres instance, refreshed hourly or daily with REFRESH MATERIALIZED VIEW CONCURRENTLY. Advantage: single system, no extra infrastructure. Disadvantage: refresh is expensive (a view over 100M rows takes 5-15 minutes); during refresh, query plans become bumpy. CONCURRENTLY does not block reads, but it is slower.
When to pick: refresh interval 1+ hour, downstream query frequency low, no team capacity for a separate analytical store.
CDC + downstream. Debezium captures changes through Postgres logical replication slots; events flow through Kafka or Pulsar into Snowflake, BigQuery, or ClickHouse. The OLTP table stays untouched. Advantage: real-time analytical, infinite scale, zero OLTP impact. Disadvantage: separate stack (Debezium + Kafka + warehouse), team capacity required.
When to pick: real-time dashboard requirement, high downstream query frequency, engineering team of 2+ people.
Field observation: in tables between 100M and 500M rows, materialized views are usually enough; past 500M, CDC + downstream becomes necessary. Some teams run both: materialized view for the hot path, warehouse for deep analytical.
Principle 6: Statistics — feeding the query planner correctly
The Postgres query planner makes plans by reading table statistics. The default statistics granularity (default_statistics_target = 100) is insufficient at 100M rows. Skewed distributions (e.g., the tenant ID column heavily concentrated on the same 5-10 tenants) cause frequent wrong-index choices.
Tuning:
-- Raise statistics target for frequently queried columns
ALTER TABLE events ALTER COLUMN tenant_id SET STATISTICS 1000;
ALTER TABLE events ALTER COLUMN created_at SET STATISTICS 500;
-- Refresh statistics
ANALYZE events;
Values of 500-1000 instead of the default 100:
- The histogram bucket count rises; the planner reads skew more accurately.
- Multi-column statistics (
CREATE STATISTICS) capture correlations between columns. - ANALYZE takes longer but at 100M rows it is in the seconds range, no problem.
Multi-column statistics example:
CREATE STATISTICS events_tenant_date
ON tenant_id, created_at
FROM events;
This exposes correlations like “tenant X is always queried within the last 30 days” to the planner. In three customer engagements, this single line flipped the planner from the wrong index to the right partition; p99 latency dropped from 4-8 seconds to 200-400 ms.
ANALYZE cadence: when statistics target rises, manual ANALYZE deserves a thought too. The autoanalyze threshold (default 10%) at 100M rows means 10M changes — too late. Either drop the autoanalyze threshold to 2% per-table, or run a daily ANALYZE through cron.
Principle 7: Monitoring — knowing what you watch
Tuning without measurement does not work. Postgres built-in extensions give visibility on three axes:
pg_stat_statements. Which queries take what time. p95/p99 outliers, query patterns, average row counts, cache hit ratio. On a 100M-row table, the slowest 20 queries should account for 80% of the watch list. Reset weekly with pg_stat_statements_reset() to keep a clean baseline.
pg_stat_user_indexes and pg_stat_user_tables. Which index is used, how many seq scans on each table, dead tuple ratios, last_vacuum and last_analyze timestamps. In a monthly audit, indexes with idx_scan = 0 are drop candidates.
pg_stat_io (Postgres 16+). Where I/O actually goes — relation reads, WAL writes, autovacuum activity? Visibility at the I/O pattern level, not just the table level. A goldmine for diagnosing disk I/O bottlenecks.
Also: auto_explain automatically logs slow query plans. With auto_explain.log_min_duration = 500ms, every query past 500 ms gets its plan logged for later analysis; gold for post-hoc investigation.
External monitoring (Datadog, Grafana, pganalyze) turns these metrics into time series; required for alerting. p95 latency, replication lag, dead tuple ratio, and cache hit ratio are the four core alarms. We described how this connects to a broader post-launch monitoring discipline elsewhere.
Putting it all together
Each of the seven principles sounds like a cliché alone; together they make the difference for teams living past 100M rows. The practical order:
- Stand up monitoring first (Principle 7). Without knowing what is slow, you cannot tune.
- Run an index audit (Principle 1). Drop unused indexes, surface partial and BRIN opportunities.
- Tune statistics (Principle 6). Statistics target plus multi-column statistics. Without feeding the planner correctly, index optimization is half done.
- Connection pooling (Principle 4). PgBouncer in transaction mode. Solve the process explosion.
- Vacuum tuning (Principle 3). Per-table scale factor and cost limit overrides.
- Partitioning (Principle 2). Inevitable as you approach the 200M row mark.
- CDC or materialized view (Principle 5). When analytical workload starts disturbing OLTP.
The first three principles take 1-2 weeks; the next four take 4-12 weeks. We use this order as our playbook with customers living past 100M rows; for teams with thinner engineering capacity we lean on the operational patterns described on our enterprise systems pillar page.
Closing
Postgres “scales” is true — but the scaling requires seven tuning calls. Each call is well known, none is rocket science. The hard part is knowing which one to apply when, and in what order; that is the experience question. If you are stuck past 100M rows where the standard advice no longer works, reach out for a discovery call — [email protected]. We share our audit checklist and decide together which three or four calls to make in the first week.