← All articlesinfrastructure

Database Reliability Engineering: The Patterns That Actually Prevent 3 AM Pages

Here's a number that should terrify you: 87% of production outages involving databases aren't caused by hardware failures or data corruption. They're caused by connection exhaustion, unmonitored replica lag, and runaway queries that nobody thought to circuit-break. We learned thi

Y
Yash Pritwani
6 min read read

# Database Reliability Engineering: The Patterns That Actually Prevent 3 AM Pages

Here's a number that should terrify you: 87% of production outages involving databases aren't caused by hardware failures or data corruption. They're caused by connection exhaustion, unmonitored replica lag, and runaway queries that nobody thought to circuit-break. We learned this the hard way after our PostgreSQL primary hit max_connections during a traffic spike that was barely 2x normal load.

The fix wasn't more hardware. It was better patterns. Here's every database reliability pattern we run in production today.

Connection Pooling with PgBouncer: Stop Wasting Connections

PostgreSQL forks a process per connection. At 500 concurrent connections, you're burning 500 processes worth of memory just on connection overhead. Most of those connections are idle 95% of the time.

PgBouncer sits between your application and PostgreSQL, multiplexing hundreds of application connections into a handful of actual database connections.

Production PgBouncer Configuration

; /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp_production
myapp_readonly = host=replica01 port=5432 dbname=myapp_production

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt

; Transaction pooling — the sweet spot for most apps
pool_mode = transaction

; Pool sizing — this is where most people get it wrong
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3

; Connection limits
max_client_conn = 1000
max_db_connections = 50

; Timeouts that actually matter
server_idle_timeout = 300
server_lifetime = 3600
client_idle_timeout = 600
query_timeout = 30
query_wait_timeout = 120

; Logging for observability
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
stats_period = 30

Key Sizing Decision

The default_pool_size should be roughly max_connections / number_of_pools. If PostgreSQL allows 100 connections and you have 3 application pools plus 1 reserve, set default_pool_size = 25. We've seen teams set this to 100+ and wonder why they're still hitting connection limits.

Monitoring PgBouncer Health

#!/bin/bash
# pgbouncer-health.sh — run every 30s via cron or systemd timer

STATS=$(psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;" -t)

# Extract active and waiting connections
ACTIVE=$(echo "$STATS" | awk '{sum += $3} END {print sum}')
WAITING=$(echo "$STATS" | awk '{sum += $5} END {print sum}')

# Alert if waiting connections exceed threshold
if [ "$WAITING" -gt 10 ]; then
  curl -s -X POST "https://ntfy.example.com/db-alerts" \
    -H "Priority: 4" \
    -d "PgBouncer: $WAITING clients waiting for connections (active: $ACTIVE)"
fi

# Push to Prometheus via pushgateway
cat <<PROM | curl -s --data-binary @- http://pushgateway:9091/metrics/job/pgbouncer
pgbouncer_active_connections $ACTIVE
pgbouncer_waiting_connections $WAITING
PROM

Query Circuit Breakers: Kill Runaway Queries Before They Kill You

A single unoptimized query doing a sequential scan on a 200GB table will consume enough I/O to starve every other query on the server. We implement circuit breakers at three layers.

Layer 1: PostgreSQL Statement Timeout

-- Global safety net: no query runs longer than 30 seconds
ALTER SYSTEM SET statement_timeout = '30s';

-- For reporting connections, allow longer
ALTER ROLE reporting SET statement_timeout = '300s';

-- Per-transaction override for known long operations
SET LOCAL statement_timeout = '120s';

Layer 2: Application-Level Circuit Breaker

import time
import threading
from contextlib import contextmanager

class QueryCircuitBreaker:
    """Circuit breaker that trips after repeated slow queries."""
    
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject queries
    HALF_OPEN = "half_open"  # Testing recovery
    
    def __init__(self, failure_threshold=5, recovery_timeout=30, slow_query_ms=2000):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.slow_query_ms = slow_query_ms
        self.failure_count = 0
        self.last_failure_time = 0
        self.state = self.CLOSED
        self._lock = threading.Lock()
    
    @contextmanager
    def protect(self, cursor, query, params=None):
        if self.state == self.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = self.HALF_OPEN
            else:
                raise CircuitOpenError(
                    f"Circuit breaker open. Recovery in "
                    f"{self.recovery_timeout - (time.time() - self.last_failure_time):.0f}s"
                )
        
        start = time.monotonic()
        try:
            cursor.execute(query, params)
            elapsed_ms = (time.monotonic() - start) * 1000
            
            if elapsed_ms > self.slow_query_ms:
                self._record_failure()
            else:
                self._record_success()
            
            yield cursor
        except Exception:
            self._record_failure()
            raise
    
    def _record_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = self.OPEN
    
    def _record_success(self):
        with self._lock:
            self.failure_count = 0
            self.state = self.CLOSED

Layer 3: pg_stat_statements Watchdog

-- Find queries consuming the most total time in the last interval
SELECT
    substring(query, 1, 80) AS short_query,
    calls,
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    rows
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- queries averaging over 1 second
ORDER BY total_exec_time DESC
LIMIT 10;

We run this query every 60 seconds and alert when any single query's mean_exec_time doubles compared to the previous hour's baseline.

Replica Lag Monitoring: The Silent Data Killer

Replica lag is insidious because reads "work" — they just return stale data. We've seen cases where a user updates their profile, refreshes the page (which hits a replica), and sees old data. Support tickets ensue.

Prometheus Exporter Query

# prometheus-postgres-exporter custom query
pg_replication:
  query: |
    SELECT
      client_addr,
      state,
      EXTRACT(EPOCH FROM (now() - sent_lsn::text::pg_lsn)) AS sent_lag_seconds,
      EXTRACT(EPOCH FROM replay_lag) AS replay_lag_seconds,
      pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes
    FROM pg_stat_replication;
  metrics:
    - client_addr:
        usage: "LABEL"
    - state:
        usage: "LABEL"
    - replay_lag_seconds:
        usage: "GAUGE"
        description: "Replication replay lag in seconds"
    - replay_lag_bytes:
        usage: "GAUGE"
        description: "Replication replay lag in bytes"

Alerting Rules

# Prometheus alerting rules
groups:
  - name: database_replication
    rules:
      - alert: ReplicaLagHigh
        expr: pg_replication_replay_lag_seconds > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.client_addr }} lag is {{ $value }}s"

      - alert: ReplicaLagCritical
        expr: pg_replication_replay_lag_seconds > 30
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replica {{ $labels.client_addr }} lag CRITICAL at {{ $value }}s"

      - alert: ReplicaDisconnected
        expr: pg_stat_replication_connected == 0
        for: 30s
        labels:
          severity: critical

Application-Level Lag-Aware Routing

class LagAwareRouter:
    """Route reads to replicas only when lag is acceptable."""
    
    def __init__(self, primary_dsn, replica_dsns, max_acceptable_lag_s=2.0):
        self.primary_dsn = primary_dsn
        self.replica_dsns = replica_dsns
        self.max_lag = max_acceptable_lag_s
        self._replica_lag = {}  # Updated by monitoring thread
    
    def get_connection(self, read_only=False, require_fresh=False):
        if not read_only or require_fresh:
            return self._connect(self.primary_dsn)
        
        healthy_replicas = [
            dsn for dsn in self.replica_dsns
            if self._replica_lag.get(dsn, float('inf')) < self.max_lag
        ]
        
        if not healthy_replicas:
            return self._connect(self.primary_dsn)  # Fallback
        
        # Round-robin among healthy replicas
        return self._connect(healthy_replicas[hash(time.time()) % len(healthy_replicas)])

Automated Failover: Patroni in Production

Manual failover at 3 AM means 15-45 minutes of downtime while someone wakes up, VPNs in, and runs the promotion. Patroni automates this to under 30 seconds.

Minimal Patroni Configuration

# /etc/patroni/patroni.yml
scope: myapp-cluster
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd3:
  hosts: 10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB — don't promote lagging replicas
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 100
        shared_buffers: 4GB
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/16/main
  authentication:
    superuser:
      username: postgres
    replication:
      username: replicator

Testing Failover (Do This Monthly)

# Trigger a controlled switchover
patronictl -c /etc/patroni/patroni.yml switchover myapp-cluster

# Verify the new primary
patronictl -c /etc/patroni/patroni.yml list

# Check application connectivity
curl -s http://localhost:8008/health | jq .

We run automated failover drills monthly. The first time we ran one, we discovered our connection strings were hardcoded to the primary's IP instead of going through PgBouncer. That single drill saved us from a real outage two weeks later.

Putting It All Together: The Reliability Stack

Layer
Tool
Purpose
Alert Threshold

|-------|------|---------|----------------|

Connection pooling
PgBouncer
Multiplex connections
>10 waiting clients
Query protection
Circuit breaker + statement_timeout
Kill runaway queries
>5 timeouts in 60s
Replication health
pg_stat_replication + Prometheus
Monitor replica lag
>5s warning, >30s critical
Automated failover
Patroni + etcd
Zero-touch promotion
Any primary unavailability
Backup verification
pgBackRest + restore tests
Ensure recoverability
Failed backup or restore

None of these patterns are revolutionary on their own. The reliability comes from layering them together so that when one layer fails, the next catches it.

What's Next

If your database setup currently consists of "a single PostgreSQL instance with nightly pg_dump," you're one bad query away from a very long night. Start with PgBouncer — it takes 20 minutes to deploy and immediately reduces your connection exhaustion risk by 10x.

Need help designing a database reliability stack for your infrastructure? Our team has deployed these patterns across dozens of production PostgreSQL clusters. [Talk to us at techsaas.cloud/contact](https://techsaas.cloud/contact) — we'll audit your current setup and build a reliability roadmap that actually prevents those 3 AM pages.

Need help with infrastructure?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.