← All articlesBackend Engineering

Zero-Downtime Database Migrations: The Complete Playbook

Complete playbook for zero-downtime database migrations in PostgreSQL. Covers expand-contract pattern, blue-green databases, shadow writes, and real incident examples.

Y
Yash Pritwani
read

<p><h2>Zero-Downtime Database Migrations: The Complete Playbook</h2></p><p><h3>Why This Matters More Than You Think</h3></p><p>At 2:47 PM on a Tuesday, someone ran <code>ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255)</code> on a 200-million-row table in PostgreSQL.</p><p>The command acquired an ACCESS EXCLUSIVE lock. Every query on the <code>orders</code> table queued behind it. The application pool exhausted its connections in 30 seconds. Health checks failed. The load balancer pulled all backends. Customers saw 503 errors.</p><p>The ALTER TABLE took 47 minutes. The incident took 3 hours to fully resolve. The business impact: ~$40,000 in lost transactions.</p><p>This is the most common database disaster in production systems, and it's entirely preventable.</p><p><h3>The Core Problem: Locks</h3></p><p>PostgreSQL (and most relational databases) uses locks to maintain data consistency during schema changes. The dangerous ones:</p><p>| Operation | Lock Type | Blocks Reads? | Blocks Writes? | |-----------|-----------|--------------|----------------|

<code>SELECT</code>
AccessShareLock
No
No
<code>INSERT/UPDATE/DELETE</code>
RowExclusiveLock
No
No
<code>CREATE INDEX</code>
ShareLock
No
<strong>Yes</strong>
<code>ALTER TABLE</code> (most)
AccessExclusiveLock
<strong>Yes</strong>
<strong>Yes</strong>
</p><p><code>AccessExclusiveLock</code> is the nuclear option. It blocks everything — reads, writes, even other <code>ALTER TABLE</code> commands. On a large table, this lock is held for the entire duration of the operation.</p><p><h3>Pattern 1: The Expand-Contract Pattern</h3></p><p>The safest approach for most migrations. Instead of modifying a column in-place, you:</p><p>1. <strong>Expand</strong>: Add the new column/table alongside the old one

2. <strong>Migrate</strong>: Copy data from old to new (in batches) 3. <strong>Transition</strong>: Update application to read from new, write to both 4. <strong>Contract</strong>: Drop the old column/table after verification</p><p><strong>Example: Renaming a column</strong></p><p>Wrong way (47-minute lock): <pre><code class="sql">ALTER TABLE orders RENAME COLUMN tracking TO tracking_id; </code></pre></p><p>Right way (zero downtime): <pre><code class="sql">-- Step 1: Add new column (instant, minimal lock) ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255);</p><p>-- Step 2: Backfill in batches (no lock) UPDATE orders SET tracking_id = tracking WHERE id BETWEEN 1 AND 100000 AND tracking_id IS NULL; -- Repeat for all batches...</p><p>-- Step 3: Application reads from tracking_id, writes to both -- (deploy application change)</p><p>-- Step 4: Verify all rows migrated SELECT COUNT(*) FROM orders WHERE tracking_id IS NULL AND tracking IS NOT NULL; -- Should return 0</p><p>-- Step 5: Drop old column (quick lock) ALTER TABLE orders DROP COLUMN tracking; </code></pre></p><p><h3>Pattern 2: Online Index Creation</h3></p><p>Standard <code>CREATE INDEX</code> locks writes for the entire duration. On a 200M row table, that's minutes to hours.</p><p><pre><code class="sql">-- WRONG: Blocks all writes CREATE INDEX idx_orders_tracking ON orders(tracking_id);</p><p>-- RIGHT: Allows concurrent reads AND writes CREATE INDEX CONCURRENTLY idx_orders_tracking ON orders(tracking_id); </code></pre></p><p><code>CONCURRENTLY</code> builds the index in two passes without holding a write lock. It takes ~2x longer but doesn't block your application.</p><p><strong>Gotcha:</strong> If <code>CREATE INDEX CONCURRENTLY</code> fails partway through, it leaves an invalid index: <pre><code class="sql">-- Check for invalid indexes SELECT indexrelid::regclass, indisvalid FROM pg_index WHERE NOT indisvalid;</p><p>-- Clean up invalid index before retrying DROP INDEX CONCURRENTLY idx_orders_tracking; </code></pre></p><p><h3>Pattern 3: Blue-Green Database Strategy</h3></p><p>For major schema changes that can't be done incrementally:</p><p>1. <strong>Blue</strong>: Current production database 2. <strong>Green</strong>: New database with updated schema 3. <strong>Replication</strong>: Stream changes from Blue to Green (using logical replication) 4. <strong>Cutover</strong>: Switch application to Green, verify, decommission Blue</p><p><pre><code class="sql">-- On Blue (source): Set up logical replication CREATE PUBLICATION orders_pub FOR TABLE orders;</p><p>-- On Green (target): Subscribe CREATE SUBSCRIPTION orders_sub CONNECTION 'host=blue-db port=5432 dbname=app' PUBLICATION orders_pub; </code></pre></p><p>This approach is complex but handles cases where the schema change is too large for expand-contract.</p><p><h3>Pattern 4: Shadow Writes</h3></p><p>When migrating between different data stores (e.g., PostgreSQL to a new PostgreSQL with different schema, or PostgreSQL to DynamoDB):</p><p><pre><code class="">Application ├── Write to OLD database (primary) ├── Write to NEW database (shadow, async) ├── Read from OLD database (primary) └── Gradually shift reads to NEW database </code></pre></p><p>Implementation: 1. Deploy application writing to both databases 2. Backfill historical data to new database 3. Run consistency checker comparing both 4. Gradually shift read traffic (10% → 50% → 100%) 5. Once 100% reads on new, stop writes to old 6. Decommission old database</p><p><h3>Tool: pg_repack for Table Rewrites</h3></p><p>When you need to change a column type or rebuild a table without locking:</p><p><pre><code class="bash"># Install apt-get install postgresql-16-repack</p><p># Repack a table (rewrites without locks) pg_repack --table orders --no-superuser-check -d mydb</p><p># Repack with a column type change # First add new column, backfill, then pg_repack to reclaim space pg_repack --table orders -d mydb </code></pre></p><p><code>pg_repack</code> rebuilds the table in the background using triggers to capture changes, then swaps the tables atomically.</p><p><h3>Migration Testing: The Non-Negotiable Step</h3></p><p>Every migration must be tested against a production-sized dataset before running in production:</p><p><pre><code class="bash"># 1. Create a production clone pg_dump prod_db | psql test_db</p><p># 2. Run the migration with timing \timing on BEGIN; -- your migration SQL here ROLLBACK; -- don't actually apply, just test timing</p><p># 3. Check lock duration SELECT pid, relation::regclass, mode, granted FROM pg_locks WHERE relation = 'orders'::regclass;</p><p># 4. Load test during migration # Run k6/locust against test environment while migration runs </code></pre></p><p><strong>Rule:</strong> If a migration takes longer than 5 seconds on a production-sized dataset, it needs the expand-contract pattern.</p><p><h3>Rollback Strategy</h3></p><p>Every migration needs a documented rollback:</p><p><pre><code class="sql">-- Migration: Add tracking_id column ALTER TABLE orders ADD COLUMN tracking_id VARCHAR(255);</p><p>-- Rollback: Remove tracking_id column ALTER TABLE orders DROP COLUMN tracking_id; </code></pre></p><p>For expand-contract migrations, rollback is built in — you just stop writing to the new column and drop it.</p><p>For destructive migrations (dropping columns, changing types), you need: 1. A backup of the affected data 2. A tested restore script 3. An estimated rollback time 4. A communication plan (who to notify if rollback is needed)</p><p><h3>The Complete Migration Checklist</h3></p><p>Before running any migration in production:</p><p><li>[ ] Migration tested on production-sized dataset</li> <li>[ ] Lock duration measured and acceptable (<5 seconds)</li> <li>[ ] Rollback script written and tested</li> <li>[ ] Application code handles both old and new schema</li> <li>[ ] Monitoring in place (connection pool usage, lock wait time, query latency)</li> <li>[ ] Maintenance window scheduled (even for "zero-downtime" migrations — things can go wrong)</li> <li>[ ] On-call engineer aware and available</li> <li>[ ] Backup verified and restoration tested</li> <li>[ ] Migration batched if operating on >1M rows</li> <li>[ ] Post-migration verification queries prepared</li></p><p><h3>Tools We Use</h3></p><p>| Tool | Purpose | |------|---------|

<code>pgmigrate</code> / <code>goose</code> / <code>flyway</code>
Schema version management
<code>pg_repack</code>
Online table rewrites
<code>pgbouncer</code>
Connection pooling during migrations
<code>pg_stat_activity</code>
Monitor running queries and locks
<code>k6</code>
Load test during migration
</p><p><h3>The $40,000 Lesson, Summarized</h3></p><p>1. Never run <code>ALTER TABLE</code> on large tables without checking lock behavior

2. Use <code>CREATE INDEX CONCURRENTLY</code> — always 3. Expand-contract pattern handles 90% of migrations safely 4. Test on production-sized data — always 5. Have a rollback plan — always</p><p>---</p><p>*We help teams build migration pipelines that don't wake anyone up at 3 AM. Book a free database architecture reviewBook a free database architecture reviewhttps://www.techsaas.cloud/contact.*</p>

#database#postgresql#migrations#zero-downtime#backend#sre

Need help with backend engineering?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.