Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Chaos engineering is not just for FAANG. Here is how to run meaningful failure experiments on a 3-person team with Docker Compose, without Chaos Monkey or a dedicated SRE.

Y
Yash Pritwani
11 min read read

Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Chaos engineering sounds like something only Netflix, Google, and Amazon can afford. Chaos Monkey. Gremlin. LitmusChaos. Enterprise tools with enterprise pricing for enterprise problems.

But the core idea is simple: break things on purpose, in a controlled way, before they break on their own at 3 AM.

You do not need a dedicated SRE team. You do not need Chaos Monkey. You need a Docker Compose setup and 30 minutes a month.

Why Small Teams Need Chaos Engineering More Than Big Teams

Big teams have redundancy. If one service goes down, there are five engineers who know how to fix it. There are runbooks. There are on-call rotations.

Small teams have single points of failure everywhere:

  • One person who knows how the database is configured
  • One deployment path that has never been tested under failure
  • One load balancer that has never actually failed over
  • Backup scripts that run every night but have never been restored

Chaos engineering for small teams is not about sophisticated failure injection. It is about answering one question: what happens when this thing breaks?

The Simplest Chaos Experiments

Experiment 1: Kill a Container

# Your most basic chaos experiment
docker kill api-server

# Questions to answer:
# 1. Does the load balancer route to healthy instances?
# 2. Does the container restart automatically?
# 3. How long until the service is back?
# 4. Did any requests fail? How many?
# 5. Did anyone get alerted?

This takes 10 seconds to run and answers five critical questions about your resilience. Most teams have never done it.

Get more insights on DevOps

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Experiment 2: Fill the Disk

# Create a large file to simulate disk pressure
docker exec postgres dd if=/dev/zero of=/tmp/fill bs=1M count=500

# Questions to answer:
# 1. Does PostgreSQL handle disk pressure gracefully?
# 2. Do your logs rotate, or do they fill the disk?
# 3. Does your monitoring alert on disk usage?
# 4. Can the service recover after disk is freed?

# Clean up
docker exec postgres rm /tmp/fill

Experiment 3: Network Partition

# Block network between two containers
docker network disconnect app-network api-server

# Questions to answer:
# 1. Does the web frontend show a useful error?
# 2. Do retries work correctly?
# 3. Are connections pooled, or does every retry open a new one?
# 4. How long until the circuit breaker trips?

# Reconnect
docker network connect app-network api-server

Experiment 4: Slow Database

# Add latency to database responses using tc (traffic control)
docker exec postgres tc qdisc add dev eth0 root netem delay 500ms

# Questions to answer:
# 1. Do API requests timeout gracefully?
# 2. Does the connection pool exhaust?
# 3. Do retries amplify the problem (retry storm)?
# 4. What does the user see?

# Remove latency
docker exec postgres tc qdisc del dev eth0 root

Experiment 5: DNS Failure

# Break DNS resolution inside a container
docker exec api-server sh -c "echo 'nameserver 192.0.2.1' > /etc/resolv.conf"

# Questions to answer:
# 1. Can the service still reach other containers by IP?
# 2. Do cached DNS entries keep working?
# 3. How does the service log DNS failures?
# 4. How long until someone notices?

# Fix
docker exec api-server sh -c "echo 'nameserver 127.0.0.11' > /etc/resolv.conf"

Building a Monthly Chaos Routine

Week 1: Container Failures

Pick a different service each month and kill it:

#!/bin/bash
# chaos-container.sh
SERVICE=$1
echo "=== Chaos: Killing $SERVICE ==="
echo "Time: $(date -u)"

# Record baseline
curl -s -o /dev/null -w "Pre-kill health: %{http_code}\n" http://localhost/health

# Kill the service
docker kill $SERVICE

# Monitor recovery
for i in $(seq 1 30); do
    CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
    echo "$(date -u) - Health: $CODE"
    sleep 2
done

echo "=== Recovery time: check timestamps above ==="

Week 2: Dependency Failures

Break an external dependency (Redis, PostgreSQL, S3):

#!/bin/bash
# chaos-dependency.sh
DEPENDENCY=$1
echo "=== Chaos: Stopping $DEPENDENCY ==="

# Stop the dependency
docker stop $DEPENDENCY

# Test the application for 2 minutes
for i in $(seq 1 60); do
    RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost/api/test)
    CODE=$(echo "$RESPONSE" | tail -1)
    BODY=$(echo "$RESPONSE" | head -1)
    echo "$(date -u) - $CODE - $BODY"
    sleep 2
done

# Restart the dependency
docker start $DEPENDENCY
echo "=== Dependency restored ==="

Week 3: Resource Exhaustion

Starve a container of CPU or memory:

# Limit API container to 10% CPU
docker update --cpus="0.1" api-server

# Run load test
hey -z 60s -c 50 http://localhost/api/endpoint

# Check: Did the service degrade gracefully?
# Check: Did the load balancer shift traffic?
# Check: Did monitoring alert?

# Restore
docker update --cpus="2.0" api-server

Week 4: Backup and Recovery

The most important experiment most teams skip:

# Test: Can you actually restore from backup?
# 1. Take a fresh backup
docker exec postgres pg_dump -U app -Fc production > /tmp/backup.dump

# 2. Create a test database
docker exec postgres createdb -U app production_restore_test

# 3. Restore the backup
docker exec -i postgres pg_restore -U app -d production_restore_test < /tmp/backup.dump

# 4. Verify data integrity
docker exec postgres psql -U app -d production_restore_test \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"

# 5. Clean up
docker exec postgres dropdb -U app production_restore_test

If this fails, your backups are not backups — they are false confidence.

The Chaos Engineering Checklist

After each experiment, document:

Free Resource

CI/CD Pipeline Blueprint

Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.

Get the Blueprint
## Chaos Experiment: [Name]
**Date:** YYYY-MM-DD
**Target:** [Service/Component]
**Hypothesis:** We expected [X] to happen when [Y] failed

### What Actually Happened
- [Observation 1]
- [Observation 2]

### Surprises
- [Unexpected behavior]

### Action Items
| Fix | Owner | Deadline |
|-----|-------|----------|
| Add health check to X | @dev | Next sprint |
| Fix retry logic in Y | @dev | This week |

### Follow-up Experiment
[What to test next based on findings]

Automation: Scheduled Chaos

Once you have confidence in your manual experiments, automate them:

# docker-compose.chaos.yml
services:
  chaos-runner:
    image: alpine:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./chaos-scripts:/scripts
    entrypoint: /bin/sh
    command: >
      -c "apk add --no-cache docker-cli curl &&
          /scripts/weekly-chaos.sh"
    profiles:
      - chaos

Run weekly with a cron job:

# Every Monday at 10 AM (when the team is awake and can respond)
0 10 * * 1 cd /app && docker compose --profile chaos run --rm chaos-runner

Never run chaos experiments on Friday afternoons or outside business hours. The goal is to learn, not to create real incidents.

What Good Looks Like

After 3-6 months of monthly chaos experiments, you should see:

  1. Faster recovery times — you have practiced restoring services so many times it becomes routine
  2. Better monitoring — every experiment reveals gaps in alerting
  3. Confidence in deployments — you know what happens when things fail because you have seen it
  4. Documented runbooks — each experiment produces documentation about how services behave under failure
  5. Fewer 3 AM surprises — the failures that used to wake you up are now handled automatically

The Bottom Line

Chaos engineering is not a tool or a platform. It is a practice: break things on purpose, observe what happens, fix the gaps, repeat.

For small teams, the experiments are simple: kill a container, break a dependency, fill a disk, restore a backup. Each one takes 30 minutes and reveals gaps that would otherwise become 3 AM incidents.

Start this month. Pick one service. Kill it. Watch what happens. Fix what breaks. You will learn more in 30 minutes than in 30 hours of reading runbooks.

#chaos-engineering#resilience#docker#sre#fault-injection#reliability#devops

Related Service

Platform Engineering

From CI/CD pipelines to service meshes, we create golden paths for your developers.

Need help with devops?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.