Zero-Downtime Kubernetes Deployments: Beyond Basic Rolling Updates
Rolling updates are just the beginning. Here is how to achieve true zero-downtime deployments with progressive delivery, canary releases, blue-green strategies, and proper readiness gates in Kubernetes.
Rolling Updates Are Not Zero-Downtime
If you think kubectl rollout restart gives you zero-downtime deployments, you are wrong. Rolling updates are the default strategy in Kubernetes, and they are better than nothing, but they do not guarantee zero downtime. Here is why:
1. Pod readiness is not application readiness: Your pod might pass its readiness probe while the application is still warming caches, establishing database connections, or loading models 2. Connection draining is often misconfigured: Existing connections get terminated when old pods are removed 3. DNS propagation delays: Service endpoints take time to update across the cluster 4. Database migrations: Schema changes cannot be rolled forward and backward simultaneously with a simple rolling update
True zero-downtime deployment requires a deliberate strategy that goes far beyond the defaults.
The Deployment Spectrum
From simplest to most sophisticated:
Rolling Update → Blue-Green → Canary → Progressive Delivery
(basic) (safe) (smart) (automated)Each step adds complexity but reduces risk. The right choice depends on your traffic volume, risk tolerance, and team maturity.
Strategy 1: Proper Rolling Updates
Before you move to advanced strategies, make sure your rolling updates are actually configured correctly.
Readiness Probes That Actually Work
Your readiness probe should verify that the application is genuinely ready to serve traffic, not just that the process is running:
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
successThreshold: 2 # Require 2 consecutive successesThe /healthz/ready endpoint should check:
Graceful Shutdown with PreStop Hooks
When Kubernetes terminates a pod, the sequence is: 1. Pod is removed from Service endpoints 2. SIGTERM is sent to the container 3. PreStop hook runs (if configured) 4. After terminationGracePeriodSeconds, SIGKILL is sent
The problem: steps 1 and 2 happen simultaneously. The pod might still receive traffic after SIGTERM. Fix this with a preStop hook:
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 60The 10-second sleep gives the kube-proxy time to update iptables rules and stop routing traffic to the terminating pod.
Pod Disruption Budgets
Prevent Kubernetes from removing too many pods at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: "75%"
selector:
matchLabels:
app: apiStrategy 2: Blue-Green Deployments
Blue-green deployments run two identical environments. Traffic switches from blue (current) to green (new) atomically.
Implementation with Kubernetes
# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
labels:
app: api
version: blue
# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
labels:
app: api
version: green
# Service points to the active color
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
version: blue # Switch to "green" to cut overWhen to Use Blue-Green
Tradeoffs
Strategy 3: Canary Deployments
Canary releases route a small percentage of traffic to the new version, gradually increasing as confidence builds.
Implementation with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api
spec:
hosts:
- api.internal
http:
- route:
- destination:
host: api
subset: stable
weight: 95
- destination:
host: api
subset: canary
weight: 5Canary Analysis
The power of canary deployments is automated analysis. Compare canary metrics against the stable version:
def analyze_canary(stable_metrics, canary_metrics):
# Compare error rates
if canary_metrics.error_rate > stable_metrics.error_rate * 1.1:
return "ROLLBACK" # 10% higher error rate = bad
# Compare latency
if canary_metrics.p99_latency > stable_metrics.p99_latency * 1.2:
return "ROLLBACK" # 20% higher latency = bad
# Compare success rate
if canary_metrics.success_rate < stable_metrics.success_rate * 0.99:
return "ROLLBACK"
return "CONTINUE" # All clear, increase trafficStrategy 4: Progressive Delivery with Argo Rollouts
Argo Rollouts extends Kubernetes with progressive delivery strategies that automate the entire canary process.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 25
- pause: {duration: 5m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: {duration: 10m}
- analysis:
templates:
- templateName: success-rate
- setWeight: 100
canaryService: api-canary
stableService: api-stableThis automatically: 1. Sends 5% of traffic to the new version 2. Waits 5 minutes and checks metrics 3. If metrics are good, increases to 25% 4. Repeats analysis at each step 5. Rolls back automatically if any analysis fails
Analysis Templates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
successCondition: result[0] > 0.99
failureLimit: 3Database Migrations: The Hard Part
None of the above strategies solve the database migration problem. Schema changes require special handling:
The Expand-Contract Pattern
1. Expand: Add new columns/tables without removing old ones 2. Deploy: Release code that works with both old and new schema 3. Migrate: Backfill data into new columns 4. Contract: Remove old columns after all pods use the new schema
-- Step 1: Expand (add new column, keep old)
ALTER TABLE users ADD COLUMN email_verified boolean DEFAULT false;
-- Step 2: Deploy code that writes to BOTH columns
-- Step 3: Backfill
UPDATE users SET email_verified = (verification_date IS NOT NULL);
-- Step 4: Contract (after all pods updated)
ALTER TABLE users DROP COLUMN verification_date;Never run destructive migrations during a deployment. Separate schema changes from code changes.
Our Production Setup
We use ArgoCD with Argo Rollouts for all production deployments:
The result: zero deployment-related incidents in the past 3 months.
Conclusion
Zero-downtime deployment is not a single technique — it is a combination of proper readiness probes, graceful shutdown, canary analysis, and progressive delivery. Start with fixing your rolling updates (readiness probes, preStop hooks, PDBs), then graduate to canary deployments as your monitoring matures.
The goal is not just zero downtime — it is confidence. When you can deploy to production at 2 PM on a Tuesday without anxiety, you have won.
Need help with platform engineering?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.