Database Reliability Engineering: The Patterns That Actually Prevent 3 AM Pages
Here's a number that should terrify you: 87% of production outages involving databases aren't caused by hardware failures or data corruption. They're caused by connection exhaustion, unmonitored replica lag, and runaway queries that nobody thought to circuit-break. We learned thi
# Database Reliability Engineering: The Patterns That Actually Prevent 3 AM Pages
Here's a number that should terrify you: 87% of production outages involving databases aren't caused by hardware failures or data corruption. They're caused by connection exhaustion, unmonitored replica lag, and runaway queries that nobody thought to circuit-break. We learned this the hard way after our PostgreSQL primary hit max_connections during a traffic spike that was barely 2x normal load.
The fix wasn't more hardware. It was better patterns. Here's every database reliability pattern we run in production today.
Connection Pooling with PgBouncer: Stop Wasting Connections
PostgreSQL forks a process per connection. At 500 concurrent connections, you're burning 500 processes worth of memory just on connection overhead. Most of those connections are idle 95% of the time.
PgBouncer sits between your application and PostgreSQL, multiplexing hundreds of application connections into a handful of actual database connections.
Production PgBouncer Configuration
; /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp_production
myapp_readonly = host=replica01 port=5432 dbname=myapp_production
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
; Transaction pooling — the sweet spot for most apps
pool_mode = transaction
; Pool sizing — this is where most people get it wrong
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
; Connection limits
max_client_conn = 1000
max_db_connections = 50
; Timeouts that actually matter
server_idle_timeout = 300
server_lifetime = 3600
client_idle_timeout = 600
query_timeout = 30
query_wait_timeout = 120
; Logging for observability
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 30Key Sizing Decision
The default_pool_size should be roughly max_connections / number_of_pools. If PostgreSQL allows 100 connections and you have 3 application pools plus 1 reserve, set default_pool_size = 25. We've seen teams set this to 100+ and wonder why they're still hitting connection limits.
Monitoring PgBouncer Health
#!/bin/bash
# pgbouncer-health.sh — run every 30s via cron or systemd timer
STATS=$(psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;" -t)
# Extract active and waiting connections
ACTIVE=$(echo "$STATS" | awk '{sum += $3} END {print sum}')
WAITING=$(echo "$STATS" | awk '{sum += $5} END {print sum}')
# Alert if waiting connections exceed threshold
if [ "$WAITING" -gt 10 ]; then
curl -s -X POST "https://ntfy.example.com/db-alerts" \
-H "Priority: 4" \
-d "PgBouncer: $WAITING clients waiting for connections (active: $ACTIVE)"
fi
# Push to Prometheus via pushgateway
cat <<PROM | curl -s --data-binary @- http://pushgateway:9091/metrics/job/pgbouncer
pgbouncer_active_connections $ACTIVE
pgbouncer_waiting_connections $WAITING
PROMQuery Circuit Breakers: Kill Runaway Queries Before They Kill You
A single unoptimized query doing a sequential scan on a 200GB table will consume enough I/O to starve every other query on the server. We implement circuit breakers at three layers.
Layer 1: PostgreSQL Statement Timeout
-- Global safety net: no query runs longer than 30 seconds
ALTER SYSTEM SET statement_timeout = '30s';
-- For reporting connections, allow longer
ALTER ROLE reporting SET statement_timeout = '300s';
-- Per-transaction override for known long operations
SET LOCAL statement_timeout = '120s';Layer 2: Application-Level Circuit Breaker
import time
import threading
from contextlib import contextmanager
class QueryCircuitBreaker:
"""Circuit breaker that trips after repeated slow queries."""
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject queries
HALF_OPEN = "half_open" # Testing recovery
def __init__(self, failure_threshold=5, recovery_timeout=30, slow_query_ms=2000):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.slow_query_ms = slow_query_ms
self.failure_count = 0
self.last_failure_time = 0
self.state = self.CLOSED
self._lock = threading.Lock()
@contextmanager
def protect(self, cursor, query, params=None):
if self.state == self.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = self.HALF_OPEN
else:
raise CircuitOpenError(
f"Circuit breaker open. Recovery in "
f"{self.recovery_timeout - (time.time() - self.last_failure_time):.0f}s"
)
start = time.monotonic()
try:
cursor.execute(query, params)
elapsed_ms = (time.monotonic() - start) * 1000
if elapsed_ms > self.slow_query_ms:
self._record_failure()
else:
self._record_success()
yield cursor
except Exception:
self._record_failure()
raise
def _record_failure(self):
with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = self.OPEN
def _record_success(self):
with self._lock:
self.failure_count = 0
self.state = self.CLOSEDLayer 3: pg_stat_statements Watchdog
-- Find queries consuming the most total time in the last interval
SELECT
substring(query, 1, 80) AS short_query,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
rows
FROM pg_stat_statements
WHERE mean_exec_time > 1000 -- queries averaging over 1 second
ORDER BY total_exec_time DESC
LIMIT 10;We run this query every 60 seconds and alert when any single query's mean_exec_time doubles compared to the previous hour's baseline.
Replica Lag Monitoring: The Silent Data Killer
Replica lag is insidious because reads "work" — they just return stale data. We've seen cases where a user updates their profile, refreshes the page (which hits a replica), and sees old data. Support tickets ensue.
Prometheus Exporter Query
# prometheus-postgres-exporter custom query
pg_replication:
query: |
SELECT
client_addr,
state,
EXTRACT(EPOCH FROM (now() - sent_lsn::text::pg_lsn)) AS sent_lag_seconds,
EXTRACT(EPOCH FROM replay_lag) AS replay_lag_seconds,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
FROM pg_stat_replication;
metrics:
- client_addr:
usage: "LABEL"
- state:
usage: "LABEL"
- replay_lag_seconds:
usage: "GAUGE"
description: "Replication replay lag in seconds"
- replay_lag_bytes:
usage: "GAUGE"
description: "Replication replay lag in bytes"Alerting Rules
# Prometheus alerting rules
groups:
- name: database_replication
rules:
- alert: ReplicaLagHigh
expr: pg_replication_replay_lag_seconds > 5
for: 2m
labels:
severity: warning
annotations:
summary: "Replica {{ $labels.client_addr }} lag is {{ $value }}s"
- alert: ReplicaLagCritical
expr: pg_replication_replay_lag_seconds > 30
for: 1m
labels:
severity: critical
annotations:
summary: "Replica {{ $labels.client_addr }} lag CRITICAL at {{ $value }}s"
- alert: ReplicaDisconnected
expr: pg_stat_replication_connected == 0
for: 30s
labels:
severity: criticalApplication-Level Lag-Aware Routing
class LagAwareRouter:
"""Route reads to replicas only when lag is acceptable."""
def __init__(self, primary_dsn, replica_dsns, max_acceptable_lag_s=2.0):
self.primary_dsn = primary_dsn
self.replica_dsns = replica_dsns
self.max_lag = max_acceptable_lag_s
self._replica_lag = {} # Updated by monitoring thread
def get_connection(self, read_only=False, require_fresh=False):
if not read_only or require_fresh:
return self._connect(self.primary_dsn)
healthy_replicas = [
dsn for dsn in self.replica_dsns
if self._replica_lag.get(dsn, float('inf')) < self.max_lag
]
if not healthy_replicas:
return self._connect(self.primary_dsn) # Fallback
# Round-robin among healthy replicas
return self._connect(healthy_replicas[hash(time.time()) % len(healthy_replicas)])Automated Failover: Patroni in Production
Manual failover at 3 AM means 15-45 minutes of downtime while someone wakes up, VPNs in, and runs the promotion. Patroni automates this to under 30 seconds.
Minimal Patroni Configuration
# /etc/patroni/patroni.yml
scope: myapp-cluster
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.10:8008
etcd3:
hosts: 10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB — don't promote lagging replicas
postgresql:
use_pg_rewind: true
parameters:
max_connections: 100
shared_buffers: 4GB
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.10:5432
data_dir: /var/lib/postgresql/16/main
authentication:
superuser:
username: postgres
replication:
username: replicatorTesting Failover (Do This Monthly)
# Trigger a controlled switchover
patronictl -c /etc/patroni/patroni.yml switchover myapp-cluster
# Verify the new primary
patronictl -c /etc/patroni/patroni.yml list
# Check application connectivity
curl -s http://localhost:8008/health | jq .We run automated failover drills monthly. The first time we ran one, we discovered our connection strings were hardcoded to the primary's IP instead of going through PgBouncer. That single drill saved us from a real outage two weeks later.
Putting It All Together: The Reliability Stack
|-------|------|---------|----------------|
None of these patterns are revolutionary on their own. The reliability comes from layering them together so that when one layer fails, the next catches it.
What's Next
If your database setup currently consists of "a single PostgreSQL instance with nightly pg_dump," you're one bad query away from a very long night. Start with PgBouncer — it takes 20 minutes to deploy and immediately reduces your connection exhaustion risk by 10x.
Need help designing a database reliability stack for your infrastructure? Our team has deployed these patterns across dozens of production PostgreSQL clusters. [Talk to us at techsaas.cloud/contact](https://techsaas.cloud/contact) — we'll audit your current setup and build a reliability roadmap that actually prevents those 3 AM pages.
Need help with infrastructure?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.