Practical Chaos Engineering: 3 Safe Experiments to Start With in Production
Everyone agrees chaos engineering is important. Almost nobody does it. The reason isn't technical -- it's fear. "What if we break production?" is the question that kills every chaos engineering initiative before it starts.
# Practical Chaos Engineering: 3 Safe Experiments to Start With in Production
Everyone agrees chaos engineering is important. Almost nobody does it. The reason isn't technical -- it's fear. "What if we break production?" is the question that kills every chaos engineering initiative before it starts.
Here's the thing: production is already chaotic. AWS had 17 publicly disclosed incidents in 2025. Your upstream API providers have outages. Disks fill up. Pods get OOM-killed. The question isn't whether chaos will hit your system -- it's whether you'll discover your failure modes on your terms or your customers' terms.
I'm going to give you three specific chaos experiments you can run safely in production this week. Not theory. Not "first, build a chaos engineering culture." Three concrete experiments with exact configurations, blast radius controls, and rollback procedures.
Prerequisites: The Safety Net
Before running any experiment, you need three things:
1. Observability: You must be able to see what's happening. Metrics (Prometheus), logs (Loki/ELK), and traces (Jaeger/Tempo) for the services you're testing. If you can't measure the impact, don't inject the fault.
2. Abort conditions: Define exactly when to stop the experiment. Example: "Abort if error rate exceeds 5% or p99 latency exceeds 2 seconds."
3. Blast radius controls: Every experiment targets a specific, limited scope. Never inject faults cluster-wide on your first attempt.
We use Litmus Chaos for Kubernetes environments. It's open-source, has a solid operator model, and provides built-in abort mechanisms. Install it:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install chaos litmuschaos/litmus \
--namespace litmus --create-namespace \
--set portal.frontend.service.type=ClusterIPExperiment 1: Pod Kill -- "Can Your Service Survive a Restart?"
This is the easiest, safest, and most revealing chaos experiment. Kill one pod of a multi-replica deployment and verify the service continues functioning.
Why this matters: You'd be surprised how many "highly available" services have hidden single points of failure. Sticky sessions that aren't properly drained. In-memory caches that cause cold-start latency spikes. Health checks that pass before the application is actually ready. We've seen services with 5 replicas go down from a single pod kill because of a badly configured readiness probe.
The hypothesis: "If we kill 1 of 3 pods of the payment-service, the service will continue processing requests with no errors and latency will remain under 500ms p99."
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-pod-kill
namespace: production
spec:
engineState: active
appinfo:
appns: production
applabel: app=payment-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: CHAOS_INTERVAL
value: "30"
- name: PODS_AFFECTED_PERC
value: "33"
- name: FORCE
value: "false"
probe:
- name: payment-health
type: httpProbe
httpProbe/inputs:
url: http://payment-service.production:8080/health
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5s
interval: 2s
retry: 3
probePollingInterval: 1sWhat to watch:
What we typically find: About 40% of services we test have at least one issue. The most common: readiness probes that return 200 too early (before connection pools, caches, or config are initialized), causing a burst of 503s when traffic routes to the not-yet-ready pod.
The fix is usually one line:
readinessProbe:
httpGet:
path: /ready # NOT /health -- separate endpoint that checks dependencies
port: 8080
initialDelaySeconds: 10 # Give the app time to warm up
periodSeconds: 5
failureThreshold: 3Experiment 2: Network Latency Injection -- "What Happens When Your Database Is Slow?"
This experiment adds artificial latency to network traffic between your application and its database (or any downstream dependency). It's more realistic than a pod kill because slow dependencies are far more common than dead ones -- and far more insidious.
Why this matters: A dead dependency triggers circuit breakers and fails fast. A slow dependency is worse: it holds connections open, backs up thread pools, and cascades failures upstream. This is how one slow database query takes down your entire platform.
The hypothesis: "If the database response time increases by 200ms, the API service will degrade gracefully -- response times will increase proportionally, but no requests will timeout and the error rate will stay below 1%."
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: db-latency-injection
namespace: production
spec:
engineState: active
appinfo:
appns: production
applabel: app=api-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "120"
- name: NETWORK_LATENCY
value: "200"
- name: DESTINATION_IPS
value: "10.96.45.12" # Database ClusterIP
- name: DESTINATION_PORTS
value: "5432"
- name: PODS_AFFECTED_PERC
value: "50" # Only affect half the pods
- name: NETWORK_INTERFACE
value: "eth0"
probe:
- name: api-error-rate
type: promProbe
promProbe/inputs:
endpoint: http://prometheus:9090
query: >
rate(http_requests_total{service="api-service",
code=~"5.."}[1m]) /
rate(http_requests_total{service="api-service"}[1m])
comparator:
type: float
criteria: "<="
value: "0.01"
mode: Continuous
runProperties:
probeTimeout: 5s
interval: 10sWhat to watch:
What we typically find: Connection pool exhaustion is the number one failure. A service configured with a pool of 20 connections and a 30-second timeout can handle about 0.67 requests/second to the database under 200ms added latency. That's often 10x less than normal throughput. The fix:
# Before: too generous, fails silently
pool = create_pool(max_connections=20, timeout=30)
# After: fail fast, degrade gracefully
pool = create_pool(
max_connections=50, # More headroom
timeout=3, # Fail fast
max_lifetime=300, # Recycle connections
health_check_interval=10 # Detect dead connections
)Experiment 3: DNS Failure -- "The Outage Nobody Practices For"
DNS failures are one of the most common causes of cascading outages, yet almost nobody tests for them. When DNS stops resolving, every service that depends on hostname resolution fails simultaneously. Unlike a single-service failure, DNS affects everything.
Why this matters: CoreDNS in Kubernetes handles all in-cluster DNS. If it hiccups (and it does -- we've seen it happen during node scaling events, after cluster upgrades, and during etcd compaction), every service-to-service call fails. The question is: do your services cache DNS responses? Do they retry? Or do they just crash?
The hypothesis: "If DNS resolution fails for 30 seconds, services with cached connections will continue functioning, and services that need new connections will retry and recover within 10 seconds after DNS returns."
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: dns-failure-test
namespace: production
spec:
engineState: active
appinfo:
appns: production
applabel: app=api-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-dns-error
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30"
- name: TARGET_HOSTNAMES
value: "payment-service.production.svc.cluster.local"
- name: PODS_AFFECTED_PERC
value: "50"
probe:
- name: existing-connections
type: httpProbe
httpProbe/inputs:
url: http://api-service.production:8080/health
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5s
interval: 5sWhat to watch:
What we typically find: Most applications don't handle DNS failures gracefully at all. They throw an unhandled exception, the health check fails, Kubernetes restarts the pod, and the restarted pod also can't resolve DNS, creating a restart loop. The fix is to ensure your application's DNS resolution has retries and your health checks don't depend on external DNS:
# ndots reduction -- speeds up DNS and reduces failure blast radius
apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: "2" # Default is 5, which causes 4 extra lookups per query
- name: timeout
value: "2" # Fail faster
- name: attempts
value: "3" # But retry
dnsPolicy: ClusterFirstRunning Your First Experiment: A Step-by-Step Playbook
1. Pick Experiment 1 (pod kill). It's the safest and most educational. 2. Choose a non-critical service with 3+ replicas. Not your payment system on day one. 3. Set up a Grafana dashboard showing that service's error rate, latency, and pod count. 4. Run the experiment during business hours with the team watching. Chaos engineering is a team sport. 5. Document what happened: Did the hypothesis hold? What surprised you? 6. Fix what you found: Typically readiness probes, timeout configs, or retry logic. 7. Run the same experiment again to verify the fix. 8. Graduate to Experiment 2, then 3, then start combining them.
The Maturity Ladder
Most teams think chaos engineering means "run Chaos Monkey in production and hope for the best." That's level 5. Start at level 1:
Level 1: Pod kill on non-critical services (you are here) Level 2: Network latency and DNS chaos on specific services Level 3: Node failures, zone outages, dependency failures Level 4: Automated chaos experiments in CI/CD (every deploy gets chaos-tested) Level 5: Continuous chaos in production with automated rollback
Getting from Level 1 to Level 3 typically takes 3-6 months. That's fine. Each level teaches you something about your system that no amount of architecture diagrams or design docs can reveal.
The ROI Argument
When your VP asks "why are we breaking production on purpose?" here's the answer: the average cost of a major production incident is $5,600 per minute (Gartner, 2025). A 30-minute outage costs $168,000. Chaos engineering lets you find and fix the failure mode that would have caused that outage -- during business hours, with the team ready, with blast radius controls, instead of at 3 AM on a Saturday.
Every chaos experiment that reveals a weakness is an incident that never happens.
---
*We run chaos engineering programs for teams that want to build genuinely resilient systems, not just systems that look resilient on architecture diagrams. From first experiments to automated chaos pipelines, we can help you get startedwe can help you get startedhttps://techsaas.cloud/services.*
Need help with sre?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.