Zero-Downtime Kubernetes Deployments: Beyond Basic Rolling Updates
Rolling updates are just the beginning. Here is how to achieve true zero-downtime deployments with progressive delivery, canary releases, blue-green strategies, and proper readiness gates in Kubernetes.
Rolling Updates Are Not Zero-Downtime
If you think kubectl rollout restart gives you zero-downtime deployments, you are wrong. Rolling updates are the default strategy in Kubernetes, and they are better than nothing, but they do not guarantee zero downtime. Here is why:
- Pod readiness is not application readiness: Your pod might pass its readiness probe while the application is still warming caches, establishing database connections, or loading models
- Connection draining is often misconfigured: Existing connections get terminated when old pods are removed
- DNS propagation delays: Service endpoints take time to update across the cluster
- Database migrations: Schema changes cannot be rolled forward and backward simultaneously with a simple rolling update
True zero-downtime deployment requires a deliberate strategy that goes far beyond the defaults.
The Deployment Spectrum
From simplest to most sophisticated:
Rolling Update → Blue-Green → Canary → Progressive Delivery
(basic) (safe) (smart) (automated)
Each step adds complexity but reduces risk. The right choice depends on your traffic volume, risk tolerance, and team maturity.
Strategy 1: Proper Rolling Updates
Before you move to advanced strategies, make sure your rolling updates are actually configured correctly.
Readiness Probes That Actually Work
Your readiness probe should verify that the application is genuinely ready to serve traffic, not just that the process is running:
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 2 # Require 2 consecutive successes
The /healthz/ready endpoint should check:
- Database connections are established
- Cache is warmed (or warming is acceptable)
- Dependent services are reachable
- Application-specific initialization is complete
Get more insights on Platform Engineering
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Graceful Shutdown with PreStop Hooks
When Kubernetes terminates a pod, the sequence is:
- Pod is removed from Service endpoints
- SIGTERM is sent to the container
- PreStop hook runs (if configured)
- After
terminationGracePeriodSeconds, SIGKILL is sent
The problem: steps 1 and 2 happen simultaneously. The pod might still receive traffic after SIGTERM. Fix this with a preStop hook:
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 60
The 10-second sleep gives the kube-proxy time to update iptables rules and stop routing traffic to the terminating pod.
Pod Disruption Budgets
Prevent Kubernetes from removing too many pods at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: "75%"
selector:
matchLabels:
app: api
Strategy 2: Blue-Green Deployments
Blue-green deployments run two identical environments. Traffic switches from blue (current) to green (new) atomically.
Implementation with Kubernetes
# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
labels:
app: api
version: blue
# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
labels:
app: api
version: green
# Service points to the active color
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
version: blue # Switch to "green" to cut over
When to Use Blue-Green
- Database migrations that require coordinated schema + code changes
- Major version upgrades where rollback must be instant
- Compliance requirements that mandate zero downtime
- When you need to run pre-deployment smoke tests against production infrastructure
Tradeoffs
- Double the resources: You need capacity for both environments simultaneously
- Database state: Both environments share the database — schema migrations need careful planning
- Session handling: User sessions must be externalized (Redis, DB) to survive the cutover
Strategy 3: Canary Deployments
Canary releases route a small percentage of traffic to the new version, gradually increasing as confidence builds.
Implementation with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api
spec:
hosts:
- api.internal
http:
- route:
- destination:
host: api
subset: stable
weight: 95
- destination:
host: api
subset: canary
weight: 5
Canary Analysis
The power of canary deployments is automated analysis. Compare canary metrics against the stable version:
def analyze_canary(stable_metrics, canary_metrics):
# Compare error rates
if canary_metrics.error_rate > stable_metrics.error_rate * 1.1:
return "ROLLBACK" # 10% higher error rate = bad
# Compare latency
if canary_metrics.p99_latency > stable_metrics.p99_latency * 1.2:
return "ROLLBACK" # 20% higher latency = bad
# Compare success rate
if canary_metrics.success_rate < stable_metrics.success_rate * 0.99:
return "ROLLBACK"
return "CONTINUE" # All clear, increase traffic
Strategy 4: Progressive Delivery with Argo Rollouts
Argo Rollouts extends Kubernetes with progressive delivery strategies that automate the entire canary process.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 25
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 100
canaryService: api-canary
stableService: api-stable
This automatically:
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
- Sends 5% of traffic to the new version
- Waits 5 minutes and checks metrics
- If metrics are good, increases to 25%
- Repeats analysis at each step
- Rolls back automatically if any analysis fails
Analysis Templates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
successCondition: result[0] > 0.99
failureLimit: 3
Database Migrations: The Hard Part
None of the above strategies solve the database migration problem. Schema changes require special handling:
The Expand-Contract Pattern
- Expand: Add new columns/tables without removing old ones
- Deploy: Release code that works with both old and new schema
- Migrate: Backfill data into new columns
- Contract: Remove old columns after all pods use the new schema
-- Step 1: Expand (add new column, keep old)
ALTER TABLE users ADD COLUMN email_verified boolean DEFAULT false;
-- Step 2: Deploy code that writes to BOTH columns
-- Step 3: Backfill
UPDATE users SET email_verified = (verification_date IS NOT NULL);
-- Step 4: Contract (after all pods updated)
ALTER TABLE users DROP COLUMN verification_date;
Never run destructive migrations during a deployment. Separate schema changes from code changes.
Our Production Setup
We use ArgoCD with Argo Rollouts for all production deployments:
- GitOps: All deployment manifests are in Git, ArgoCD syncs automatically
- Progressive delivery: Every production deployment uses canary analysis
- Automated rollback: If error rate increases >5% or latency increases >20%, automatic rollback
- Deployment windows: Non-emergency deployments only during business hours
- Post-deployment verification: Synthetic tests run after every deployment
The result: zero deployment-related incidents in the past 3 months.
Conclusion
Zero-downtime deployment is not a single technique — it is a combination of proper readiness probes, graceful shutdown, canary analysis, and progressive delivery. Start with fixing your rolling updates (readiness probes, preStop hooks, PDBs), then graduate to canary deployments as your monitoring matures.
The goal is not just zero downtime — it is confidence. When you can deploy to production at 2 PM on a Tuesday without anxiety, you have won.
Related Service
Cloud Solutions
Let our experts help you build the right technology strategy for your business.
Need help with platform engineering?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.