Practical Chaos Engineering: 3 Safe Experiments to Start With in Production

Everyone agrees chaos engineering is important. Almost nobody does it. The reason isn't technical -- it's fear. "What if we break production?" is the question that kills every chaos engineering initiative before it starts.

Y
Yash Pritwani
8 min read read

# Practical Chaos Engineering: 3 Safe Experiments to Start With in Production

Everyone agrees chaos engineering is important. Almost nobody does it. The reason isn't technical -- it's fear. "What if we break production?" is the question that kills every chaos engineering initiative before it starts.

Here's the thing: production is already chaotic. AWS had 17 publicly disclosed incidents in 2025. Your upstream API providers have outages. Disks fill up. Pods get OOM-killed. The question isn't whether chaos will hit your system -- it's whether you'll discover your failure modes on your terms or your customers' terms.

I'm going to give you three specific chaos experiments you can run safely in production this week. Not theory. Not "first, build a chaos engineering culture." Three concrete experiments with exact configurations, blast radius controls, and rollback procedures.

Prerequisites: The Safety Net

Before running any experiment, you need three things:

1. Observability: You must be able to see what's happening. Metrics (Prometheus), logs (Loki/ELK), and traces (Jaeger/Tempo) for the services you're testing. If you can't measure the impact, don't inject the fault.

2. Abort conditions: Define exactly when to stop the experiment. Example: "Abort if error rate exceeds 5% or p99 latency exceeds 2 seconds."

3. Blast radius controls: Every experiment targets a specific, limited scope. Never inject faults cluster-wide on your first attempt.

We use Litmus Chaos for Kubernetes environments. It's open-source, has a solid operator model, and provides built-in abort mechanisms. Install it:

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set portal.frontend.service.type=ClusterIP

Experiment 1: Pod Kill -- "Can Your Service Survive a Restart?"

This is the easiest, safest, and most revealing chaos experiment. Kill one pod of a multi-replica deployment and verify the service continues functioning.

Why this matters: You'd be surprised how many "highly available" services have hidden single points of failure. Sticky sessions that aren't properly drained. In-memory caches that cause cold-start latency spikes. Health checks that pass before the application is actually ready. We've seen services with 5 replicas go down from a single pod kill because of a badly configured readiness probe.

The hypothesis: "If we kill 1 of 3 pods of the payment-service, the service will continue processing requests with no errors and latency will remain under 500ms p99."

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-pod-kill
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: app=payment-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: CHAOS_INTERVAL
              value: "30"
            - name: PODS_AFFECTED_PERC
              value: "33"
            - name: FORCE
              value: "false"
        probe:
          - name: payment-health
            type: httpProbe
            httpProbe/inputs:
              url: http://payment-service.production:8080/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              interval: 2s
              retry: 3
              probePollingInterval: 1s

What to watch:

Does Kubernetes schedule a replacement pod quickly? (Should be <10 seconds)
Do in-flight requests fail or get retried? (Check your load balancer's retry config)
Does the new pod become ready before traffic hits it? (readinessProbe correctness)
Is there a latency spike during the transition? (Connection draining behavior)

What we typically find: About 40% of services we test have at least one issue. The most common: readiness probes that return 200 too early (before connection pools, caches, or config are initialized), causing a burst of 503s when traffic routes to the not-yet-ready pod.

The fix is usually one line:

readinessProbe:
  httpGet:
    path: /ready  # NOT /health -- separate endpoint that checks dependencies
    port: 8080
  initialDelaySeconds: 10  # Give the app time to warm up
  periodSeconds: 5
  failureThreshold: 3

Experiment 2: Network Latency Injection -- "What Happens When Your Database Is Slow?"

This experiment adds artificial latency to network traffic between your application and its database (or any downstream dependency). It's more realistic than a pod kill because slow dependencies are far more common than dead ones -- and far more insidious.

Why this matters: A dead dependency triggers circuit breakers and fails fast. A slow dependency is worse: it holds connections open, backs up thread pools, and cascades failures upstream. This is how one slow database query takes down your entire platform.

The hypothesis: "If the database response time increases by 200ms, the API service will degrade gracefully -- response times will increase proportionally, but no requests will timeout and the error rate will stay below 1%."

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: db-latency-injection
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: app=api-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "120"
            - name: NETWORK_LATENCY
              value: "200"
            - name: DESTINATION_IPS
              value: "10.96.45.12"  # Database ClusterIP
            - name: DESTINATION_PORTS
              value: "5432"
            - name: PODS_AFFECTED_PERC
              value: "50"  # Only affect half the pods
            - name: NETWORK_INTERFACE
              value: "eth0"
        probe:
          - name: api-error-rate
            type: promProbe
            promProbe/inputs:
              endpoint: http://prometheus:9090
              query: >
                rate(http_requests_total{service="api-service",
                code=~"5.."}[1m]) /
                rate(http_requests_total{service="api-service"}[1m])
              comparator:
                type: float
                criteria: "<="
                value: "0.01"
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              interval: 10s

What to watch:

Does the application's connection pool handle the increased latency? (Watch for pool exhaustion)
Do HTTP client timeouts fire correctly? (Many teams set timeouts to 30s which is way too long)
Does the circuit breaker trip? (It should, if latency exceeds your SLO)
Do retries amplify the problem? (Retry storms on a slow dependency make everything worse)

What we typically find: Connection pool exhaustion is the number one failure. A service configured with a pool of 20 connections and a 30-second timeout can handle about 0.67 requests/second to the database under 200ms added latency. That's often 10x less than normal throughput. The fix:

# Before: too generous, fails silently
pool = create_pool(max_connections=20, timeout=30)

# After: fail fast, degrade gracefully
pool = create_pool(
    max_connections=50,        # More headroom
    timeout=3,                 # Fail fast
    max_lifetime=300,          # Recycle connections
    health_check_interval=10   # Detect dead connections
)

Experiment 3: DNS Failure -- "The Outage Nobody Practices For"

DNS failures are one of the most common causes of cascading outages, yet almost nobody tests for them. When DNS stops resolving, every service that depends on hostname resolution fails simultaneously. Unlike a single-service failure, DNS affects everything.

Why this matters: CoreDNS in Kubernetes handles all in-cluster DNS. If it hiccups (and it does -- we've seen it happen during node scaling events, after cluster upgrades, and during etcd compaction), every service-to-service call fails. The question is: do your services cache DNS responses? Do they retry? Or do they just crash?

The hypothesis: "If DNS resolution fails for 30 seconds, services with cached connections will continue functioning, and services that need new connections will retry and recover within 10 seconds after DNS returns."

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: dns-failure-test
  namespace: production
spec:
  engineState: active
  appinfo:
    appns: production
    applabel: app=api-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-dns-error
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "30"
            - name: TARGET_HOSTNAMES
              value: "payment-service.production.svc.cluster.local"
            - name: PODS_AFFECTED_PERC
              value: "50"
        probe:
          - name: existing-connections
            type: httpProbe
            httpProbe/inputs:
              url: http://api-service.production:8080/health
              method:
                get:
                  criteria: ==
                  responseCode: "200"
            mode: Continuous
            runProperties:
              probeTimeout: 5s
              interval: 5s

What to watch:

Do existing connections (already resolved) continue working? (They should -- TCP connections don't need DNS after establishment)
How quickly do services recover after DNS returns? (Watch for negative DNS caching -- some resolvers cache NXDOMAIN for 30 seconds)
Do services log meaningful errors? (Or just "connection refused" with no DNS context)
Does the application distinguish between "DNS failed" and "service is down"?

What we typically find: Most applications don't handle DNS failures gracefully at all. They throw an unhandled exception, the health check fails, Kubernetes restarts the pod, and the restarted pod also can't resolve DNS, creating a restart loop. The fix is to ensure your application's DNS resolution has retries and your health checks don't depend on external DNS:

# ndots reduction -- speeds up DNS and reduces failure blast radius
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"     # Default is 5, which causes 4 extra lookups per query
      - name: timeout
        value: "2"     # Fail faster
      - name: attempts
        value: "3"     # But retry
  dnsPolicy: ClusterFirst

Running Your First Experiment: A Step-by-Step Playbook

1. Pick Experiment 1 (pod kill). It's the safest and most educational. 2. Choose a non-critical service with 3+ replicas. Not your payment system on day one. 3. Set up a Grafana dashboard showing that service's error rate, latency, and pod count. 4. Run the experiment during business hours with the team watching. Chaos engineering is a team sport. 5. Document what happened: Did the hypothesis hold? What surprised you? 6. Fix what you found: Typically readiness probes, timeout configs, or retry logic. 7. Run the same experiment again to verify the fix. 8. Graduate to Experiment 2, then 3, then start combining them.

The Maturity Ladder

Most teams think chaos engineering means "run Chaos Monkey in production and hope for the best." That's level 5. Start at level 1:

Level 1: Pod kill on non-critical services (you are here) Level 2: Network latency and DNS chaos on specific services Level 3: Node failures, zone outages, dependency failures Level 4: Automated chaos experiments in CI/CD (every deploy gets chaos-tested) Level 5: Continuous chaos in production with automated rollback

Getting from Level 1 to Level 3 typically takes 3-6 months. That's fine. Each level teaches you something about your system that no amount of architecture diagrams or design docs can reveal.

The ROI Argument

When your VP asks "why are we breaking production on purpose?" here's the answer: the average cost of a major production incident is $5,600 per minute (Gartner, 2025). A 30-minute outage costs $168,000. Chaos engineering lets you find and fix the failure mode that would have caused that outage -- during business hours, with the team ready, with blast radius controls, instead of at 3 AM on a Saturday.

Every chaos experiment that reveals a weakness is an incident that never happens.

---

*We run chaos engineering programs for teams that want to build genuinely resilient systems, not just systems that look resilient on architecture diagrams. From first experiments to automated chaos pipelines, we can help you get startedwe can help you get startedhttps://techsaas.cloud/services.*

Need help with sre?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.