← All articlesBackend Engineering

Zero-Downtime Database Migrations: The Complete Playbook

Complete playbook for zero-downtime database migrations in PostgreSQL. Covers expand-contract pattern, blue-green databases, shadow writes, and real incident examples.

Yash Pritwani

19 April 2026 read

<h2>Zero-Downtime Database Migrations: The Complete Playbook</h2><h3>Why This Matters More Than You Think</h3>At 2:47 PM on a Tuesday, someone ran <code>ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255)</code> on a 200-million-row table in PostgreSQL.The command acquired an ACCESS EXCLUSIVE lock. Every query on the <code>orders</code> table queued behind it. The application pool exhausted its connections in 30 seconds. Health checks failed. The load balancer pulled all backends. Customers saw 503 errors.The ALTER TABLE took 47 minutes. The incident took 3 hours to fully resolve. The business impact: ~$40,000 in lost transactions.This is the most common database disaster in production systems, and it's entirely preventable.<h3>The Core Problem: Locks</h3>PostgreSQL (and most relational databases) uses locks to maintain data consistency during schema changes. The dangerous ones:| Operation | Lock Type | Blocks Reads? | Blocks Writes? | |-----------|-----------|--------------|----------------|

<code>SELECT</code>

AccessShareLock

<code>INSERT/UPDATE/DELETE</code>

RowExclusiveLock

<code>CREATE INDEX</code>

ShareLock

Yes

<code>ALTER TABLE</code> (most)

AccessExclusiveLock

Yes

<code>AccessExclusiveLock</code> is the nuclear option. It blocks everything — reads, writes, even other <code>ALTER TABLE</code> commands. On a large table, this lock is held for the entire duration of the operation.<h3>Pattern 1: The Expand-Contract Pattern</h3>The safest approach for most migrations. Instead of modifying a column in-place, you:1. Expand: Add the new column/table alongside the old one

2. Migrate: Copy data from old to new (in batches) 3. Transition: Update application to read from new, write to both 4. Contract: Drop the old column/table after verificationExample: Renaming a columnWrong way (47-minute lock): <pre><code class="sql">ALTER TABLE orders RENAME COLUMN tracking TO tracking_id; </code></pre>Right way (zero downtime): <pre><code class="sql">-- Step 1: Add new column (instant, minimal lock) ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255);-- Step 2: Backfill in batches (no lock) UPDATE orders SET tracking_id = tracking WHERE id BETWEEN 1 AND 100000 AND tracking_id IS NULL; -- Repeat for all batches...-- Step 3: Application reads from tracking_id, writes to both -- (deploy application change)-- Step 4: Verify all rows migrated SELECT COUNT(*) FROM orders WHERE tracking_id IS NULL AND tracking IS NOT NULL; -- Should return 0-- Step 5: Drop old column (quick lock) ALTER TABLE orders DROP COLUMN tracking; </code></pre><h3>Pattern 2: Online Index Creation</h3>Standard <code>CREATE INDEX</code> locks writes for the entire duration. On a 200M row table, that's minutes to hours.<pre><code class="sql">-- WRONG: Blocks all writes CREATE INDEX idx_orders_tracking ON orders(tracking_id);-- RIGHT: Allows concurrent reads AND writes CREATE INDEX CONCURRENTLY idx_orders_tracking ON orders(tracking_id); </code></pre><code>CONCURRENTLY</code> builds the index in two passes without holding a write lock. It takes ~2x longer but doesn't block your application.Gotcha: If <code>CREATE INDEX CONCURRENTLY</code> fails partway through, it leaves an invalid index: <pre><code class="sql">-- Check for invalid indexes SELECT indexrelid::regclass, indisvalid FROM pg_index WHERE NOT indisvalid;-- Clean up invalid index before retrying DROP INDEX CONCURRENTLY idx_orders_tracking; </code></pre><h3>Pattern 3: Blue-Green Database Strategy</h3>For major schema changes that can't be done incrementally:1. Blue: Current production database 2. Green: New database with updated schema 3. Replication: Stream changes from Blue to Green (using logical replication) 4. Cutover: Switch application to Green, verify, decommission Blue<pre><code class="sql">-- On Blue (source): Set up logical replication CREATE PUBLICATION orders_pub FOR TABLE orders;-- On Green (target): Subscribe CREATE SUBSCRIPTION orders_sub CONNECTION 'host=blue-db port=5432 dbname=app' PUBLICATION orders_pub; </code></pre>This approach is complex but handles cases where the schema change is too large for expand-contract.<h3>Pattern 4: Shadow Writes</h3>When migrating between different data stores (e.g., PostgreSQL to a new PostgreSQL with different schema, or PostgreSQL to DynamoDB):<pre><code class="">Application ├── Write to OLD database (primary) ├── Write to NEW database (shadow, async) ├── Read from OLD database (primary) └── Gradually shift reads to NEW database </code></pre>Implementation: 1. Deploy application writing to both databases 2. Backfill historical data to new database 3. Run consistency checker comparing both 4. Gradually shift read traffic (10% → 50% → 100%) 5. Once 100% reads on new, stop writes to old 6. Decommission old database<h3>Tool: pg_repack for Table Rewrites</h3>When you need to change a column type or rebuild a table without locking:<pre><code class="bash"># Install apt-get install postgresql-16-repack# Repack a table (rewrites without locks) pg_repack --table orders --no-superuser-check -d mydb# Repack with a column type change # First add new column, backfill, then pg_repack to reclaim space pg_repack --table orders -d mydb </code></pre><code>pg_repack</code> rebuilds the table in the background using triggers to capture changes, then swaps the tables atomically.<h3>Migration Testing: The Non-Negotiable Step</h3>Every migration must be tested against a production-sized dataset before running in production:<pre><code class="bash"># 1. Create a production clone pg_dump prod_db | psql test_db# 2. Run the migration with timing \timing on BEGIN; -- your migration SQL here ROLLBACK; -- don't actually apply, just test timing# 3. Check lock duration SELECT pid, relation::regclass, mode, granted FROM pg_locks WHERE relation = 'orders'::regclass;# 4. Load test during migration # Run k6/locust against test environment while migration runs </code></pre>Rule: If a migration takes longer than 5 seconds on a production-sized dataset, it needs the expand-contract pattern.<h3>Rollback Strategy</h3>Every migration needs a documented rollback:<pre><code class="sql">-- Migration: Add tracking_id column ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255);-- Rollback: Remove tracking_id column ALTER TABLE orders DROP COLUMN tracking_id; </code></pre>For expand-contract migrations, rollback is built in — you just stop writing to the new column and drop it.For destructive migrations (dropping columns, changing types), you need: 1. A backup of the affected data 2. A tested restore script 3. An estimated rollback time 4. A communication plan (who to notify if rollback is needed)<h3>The Complete Migration Checklist</h3>Before running any migration in production:<li>[ ] Migration tested on production-sized dataset</li> <li>[ ] Lock duration measured and acceptable (<5 seconds)</li> <li>[ ] Rollback script written and tested</li> <li>[ ] Application code handles both old and new schema</li> <li>[ ] Monitoring in place (connection pool usage, lock wait time, query latency)</li> <li>[ ] Maintenance window scheduled (even for "zero-downtime" migrations — things can go wrong)</li> <li>[ ] On-call engineer aware and available</li> <li>[ ] Backup verified and restoration tested</li> <li>[ ] Migration batched if operating on >1M rows</li> <li>[ ] Post-migration verification queries prepared</li><h3>Tools We Use</h3>| Tool | Purpose | |------|---------|

<code>pgmigrate</code> / <code>goose</code> / <code>flyway</code>

Schema version management

<code>pg_repack</code>

Online table rewrites

<code>pgbouncer</code>

Connection pooling during migrations

<code>pg_stat_activity</code>

Monitor running queries and locks

Load test during migration

<h3>The $40,000 Lesson, Summarized</h3>1. Never run <code>ALTER TABLE</code> on large tables without checking lock behavior

2. Use <code>CREATE INDEX CONCURRENTLY</code> — always 3. Expand-contract pattern handles 90% of migrations safely 4. Test on production-sized data — always 5. Have a rollback plan — always---*We help teams build migration pipelines that don't wake anyone up at 3 AM. Book a free database architecture reviewBook a free database architecture reviewhttps://www.techsaas.cloud/contact.*

#database#postgresql#migrations#zero-downtime#backend#sre

Need help with backend engineering?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation Call +91 84569 84870

Zero-Downtime Database Migrations: The Complete Playbook

Need help with backend engineering?

Related Articles

Database Migrations: Flyway vs Liquibase vs Atlas

The PostgreSQL Consolidation: Why 'Just Use Postgres' Is the 2026 AI Database Strategy

PostgreSQL in Production: Performance Tuning, Backups, and High Availability