← All articlesPlatform Engineering

Zero-Downtime Kubernetes Deployments: Beyond Basic Rolling Updates

Rolling updates are just the beginning. Here is how to achieve true zero-downtime deployments with progressive delivery, canary releases, blue-green strategies, and proper readiness gates in Kubernetes.

Y
Yash Pritwani
10 min read read

Rolling Updates Are Not Zero-Downtime

If you think kubectl rollout restart gives you zero-downtime deployments, you are wrong. Rolling updates are the default strategy in Kubernetes, and they are better than nothing, but they do not guarantee zero downtime. Here is why:

  1. Pod readiness is not application readiness: Your pod might pass its readiness probe while the application is still warming caches, establishing database connections, or loading models
  2. Connection draining is often misconfigured: Existing connections get terminated when old pods are removed
  3. DNS propagation delays: Service endpoints take time to update across the cluster
  4. Database migrations: Schema changes cannot be rolled forward and backward simultaneously with a simple rolling update

True zero-downtime deployment requires a deliberate strategy that goes far beyond the defaults.

The Deployment Spectrum

From simplest to most sophisticated:

Rolling Update → Blue-Green → Canary → Progressive Delivery
   (basic)      (safe)     (smart)     (automated)

Each step adds complexity but reduces risk. The right choice depends on your traffic volume, risk tolerance, and team maturity.

Strategy 1: Proper Rolling Updates

Before you move to advanced strategies, make sure your rolling updates are actually configured correctly.

Readiness Probes That Actually Work

Your readiness probe should verify that the application is genuinely ready to serve traffic, not just that the process is running:

readinessProbe:
  httpGet:
    path: /healthz/ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3
  successThreshold: 2  # Require 2 consecutive successes

The /healthz/ready endpoint should check:

  • Database connections are established
  • Cache is warmed (or warming is acceptable)
  • Dependent services are reachable
  • Application-specific initialization is complete

Get more insights on Platform Engineering

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Graceful Shutdown with PreStop Hooks

When Kubernetes terminates a pod, the sequence is:

  1. Pod is removed from Service endpoints
  2. SIGTERM is sent to the container
  3. PreStop hook runs (if configured)
  4. After terminationGracePeriodSeconds, SIGKILL is sent

The problem: steps 1 and 2 happen simultaneously. The pod might still receive traffic after SIGTERM. Fix this with a preStop hook:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 10"]
terminationGracePeriodSeconds: 60

The 10-second sleep gives the kube-proxy time to update iptables rules and stop routing traffic to the terminating pod.

Pod Disruption Budgets

Prevent Kubernetes from removing too many pods at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: "75%"
  selector:
    matchLabels:
      app: api

Strategy 2: Blue-Green Deployments

Blue-green deployments run two identical environments. Traffic switches from blue (current) to green (new) atomically.

Implementation with Kubernetes

# Blue deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
  labels:
    app: api
    version: blue

# Green deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
  labels:
    app: api
    version: green

# Service points to the active color
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    version: blue  # Switch to "green" to cut over

When to Use Blue-Green

  • Database migrations that require coordinated schema + code changes
  • Major version upgrades where rollback must be instant
  • Compliance requirements that mandate zero downtime
  • When you need to run pre-deployment smoke tests against production infrastructure

Tradeoffs

  • Double the resources: You need capacity for both environments simultaneously
  • Database state: Both environments share the database — schema migrations need careful planning
  • Session handling: User sessions must be externalized (Redis, DB) to survive the cutover

Strategy 3: Canary Deployments

Canary releases route a small percentage of traffic to the new version, gradually increasing as confidence builds.

Implementation with Istio

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
    - api.internal
  http:
    - route:
        - destination:
            host: api
            subset: stable
          weight: 95
        - destination:
            host: api
            subset: canary
          weight: 5

Canary Analysis

The power of canary deployments is automated analysis. Compare canary metrics against the stable version:

def analyze_canary(stable_metrics, canary_metrics):
    # Compare error rates
    if canary_metrics.error_rate > stable_metrics.error_rate * 1.1:
        return "ROLLBACK"  # 10% higher error rate = bad

    # Compare latency
    if canary_metrics.p99_latency > stable_metrics.p99_latency * 1.2:
        return "ROLLBACK"  # 20% higher latency = bad

    # Compare success rate
    if canary_metrics.success_rate < stable_metrics.success_rate * 0.99:
        return "ROLLBACK"

    return "CONTINUE"  # All clear, increase traffic

Strategy 4: Progressive Delivery with Argo Rollouts

Argo Rollouts extends Kubernetes with progressive delivery strategies that automate the entire canary process.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 25
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 50
        - pause: {duration: 10m}
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 100
      canaryService: api-canary
      stableService: api-stable

This automatically:

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist
  1. Sends 5% of traffic to the new version
  2. Waits 5 minutes and checks metrics
  3. If metrics are good, increases to 25%
  4. Repeats analysis at each step
  5. Rolls back automatically if any analysis fails

Analysis Templates

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
      successCondition: result[0] > 0.99
      failureLimit: 3

Database Migrations: The Hard Part

None of the above strategies solve the database migration problem. Schema changes require special handling:

The Expand-Contract Pattern

  1. Expand: Add new columns/tables without removing old ones
  2. Deploy: Release code that works with both old and new schema
  3. Migrate: Backfill data into new columns
  4. Contract: Remove old columns after all pods use the new schema
-- Step 1: Expand (add new column, keep old)
ALTER TABLE users ADD COLUMN email_verified boolean DEFAULT false;

-- Step 2: Deploy code that writes to BOTH columns

-- Step 3: Backfill
UPDATE users SET email_verified = (verification_date IS NOT NULL);

-- Step 4: Contract (after all pods updated)
ALTER TABLE users DROP COLUMN verification_date;

Never run destructive migrations during a deployment. Separate schema changes from code changes.

Our Production Setup

We use ArgoCD with Argo Rollouts for all production deployments:

  • GitOps: All deployment manifests are in Git, ArgoCD syncs automatically
  • Progressive delivery: Every production deployment uses canary analysis
  • Automated rollback: If error rate increases >5% or latency increases >20%, automatic rollback
  • Deployment windows: Non-emergency deployments only during business hours
  • Post-deployment verification: Synthetic tests run after every deployment

The result: zero deployment-related incidents in the past 3 months.

Conclusion

Zero-downtime deployment is not a single technique — it is a combination of proper readiness probes, graceful shutdown, canary analysis, and progressive delivery. Start with fixing your rolling updates (readiness probes, preStop hooks, PDBs), then graduate to canary deployments as your monitoring matures.

The goal is not just zero downtime — it is confidence. When you can deploy to production at 2 PM on a Tuesday without anxiety, you have won.

#Kubernetes#Zero-Downtime#Deployments#Canary#Blue-Green#ArgoCD#GitOps#Progressive Delivery

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with platform engineering?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.