← All articlesDevOps

Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Chaos engineering is not just for FAANG. Here is how to run meaningful failure experiments on a 3-person team with Docker Compose, without Chaos Monkey or a dedicated SRE.

Yash Pritwani

23 March 202611 min read read

Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Chaos engineering sounds like something only Netflix, Google, and Amazon can afford. Chaos Monkey. Gremlin. LitmusChaos. Enterprise tools with enterprise pricing for enterprise problems.

But the core idea is simple: break things on purpose, in a controlled way, before they break on their own at 3 AM.

You do not need a dedicated SRE team. You do not need Chaos Monkey. You need a Docker Compose setup and 30 minutes a month.

Why Small Teams Need Chaos Engineering More Than Big Teams

Big teams have redundancy. If one service goes down, there are five engineers who know how to fix it. There are runbooks. There are on-call rotations.

Small teams have single points of failure everywhere:

One person who knows how the database is configured
One deployment path that has never been tested under failure
One load balancer that has never actually failed over
Backup scripts that run every night but have never been restored

Chaos engineering for small teams is not about sophisticated failure injection. It is about answering one question: what happens when this thing breaks?

The Simplest Chaos Experiments

Experiment 1: Kill a Container

# Your most basic chaos experiment
docker kill api-server

# Questions to answer:
# 1. Does the load balancer route to healthy instances?
# 2. Does the container restart automatically?
# 3. How long until the service is back?
# 4. Did any requests fail? How many?
# 5. Did anyone get alerted?

This takes 10 seconds to run and answers five critical questions about your resilience. Most teams have never done it.

Get more insights on DevOps

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Experiment 2: Fill the Disk

# Create a large file to simulate disk pressure
docker exec postgres dd if=/dev/zero of=/tmp/fill bs=1M count=500

# Questions to answer:
# 1. Does PostgreSQL handle disk pressure gracefully?
# 2. Do your logs rotate, or do they fill the disk?
# 3. Does your monitoring alert on disk usage?
# 4. Can the service recover after disk is freed?

# Clean up
docker exec postgres rm /tmp/fill

Experiment 3: Network Partition

# Block network between two containers
docker network disconnect app-network api-server

# Questions to answer:
# 1. Does the web frontend show a useful error?
# 2. Do retries work correctly?
# 3. Are connections pooled, or does every retry open a new one?
# 4. How long until the circuit breaker trips?

# Reconnect
docker network connect app-network api-server

Experiment 4: Slow Database

# Add latency to database responses using tc (traffic control)
docker exec postgres tc qdisc add dev eth0 root netem delay 500ms

# Questions to answer:
# 1. Do API requests timeout gracefully?
# 2. Does the connection pool exhaust?
# 3. Do retries amplify the problem (retry storm)?
# 4. What does the user see?

# Remove latency
docker exec postgres tc qdisc del dev eth0 root

Experiment 5: DNS Failure

# Break DNS resolution inside a container
docker exec api-server sh -c "echo 'nameserver 192.0.2.1' > /etc/resolv.conf"

# Questions to answer:
# 1. Can the service still reach other containers by IP?
# 2. Do cached DNS entries keep working?
# 3. How does the service log DNS failures?
# 4. How long until someone notices?

# Fix
docker exec api-server sh -c "echo 'nameserver 127.0.0.11' > /etc/resolv.conf"

Building a Monthly Chaos Routine

Week 1: Container Failures

Pick a different service each month and kill it:

#!/bin/bash
# chaos-container.sh
SERVICE=$1
echo "=== Chaos: Killing $SERVICE ==="
echo "Time: $(date -u)"

# Record baseline
curl -s -o /dev/null -w "Pre-kill health: %{http_code}\n" http://localhost/health

# Kill the service
docker kill $SERVICE

# Monitor recovery
for i in $(seq 1 30); do
    CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost/health)
    echo "$(date -u) - Health: $CODE"
    sleep 2
done

echo "=== Recovery time: check timestamps above ==="

→

AIOps in Practice: How AI Is Transforming Incident Management in 202610 min read read

→

POSSE Strategy: Publish on Your Own Site, Syndicate Everywhere10 min read read

→

Alert Fatigue in DevOps: Building Intelligent Alerting Systems That Actually Work11 min read read

Week 2: Dependency Failures

Break an external dependency (Redis, PostgreSQL, S3):

#!/bin/bash
# chaos-dependency.sh
DEPENDENCY=$1
echo "=== Chaos: Stopping $DEPENDENCY ==="

# Stop the dependency
docker stop $DEPENDENCY

# Test the application for 2 minutes
for i in $(seq 1 60); do
    RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost/api/test)
    CODE=$(echo "$RESPONSE" | tail -1)
    BODY=$(echo "$RESPONSE" | head -1)
    echo "$(date -u) - $CODE - $BODY"
    sleep 2
done

# Restart the dependency
docker start $DEPENDENCY
echo "=== Dependency restored ==="

Week 3: Resource Exhaustion

Starve a container of CPU or memory:

# Limit API container to 10% CPU
docker update --cpus="0.1" api-server

# Run load test
hey -z 60s -c 50 http://localhost/api/endpoint

# Check: Did the service degrade gracefully?
# Check: Did the load balancer shift traffic?
# Check: Did monitoring alert?

# Restore
docker update --cpus="2.0" api-server

Week 4: Backup and Recovery

The most important experiment most teams skip:

# Test: Can you actually restore from backup?
# 1. Take a fresh backup
docker exec postgres pg_dump -U app -Fc production > /tmp/backup.dump

# 2. Create a test database
docker exec postgres createdb -U app production_restore_test

# 3. Restore the backup
docker exec -i postgres pg_restore -U app -d production_restore_test < /tmp/backup.dump

# 4. Verify data integrity
docker exec postgres psql -U app -d production_restore_test \
  -c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"

# 5. Clean up
docker exec postgres dropdb -U app production_restore_test

If this fails, your backups are not backups — they are false confidence.

The Chaos Engineering Checklist

After each experiment, document:

Free Resource

CI/CD Pipeline Blueprint

Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.

Get the Blueprint

## Chaos Experiment: [Name]
**Date:** YYYY-MM-DD
**Target:** [Service/Component]
**Hypothesis:** We expected [X] to happen when [Y] failed

### What Actually Happened
- [Observation 1]
- [Observation 2]

### Surprises
- [Unexpected behavior]

### Action Items
| Fix | Owner | Deadline |
|-----|-------|----------|
| Add health check to X | @dev | Next sprint |
| Fix retry logic in Y | @dev | This week |

### Follow-up Experiment
[What to test next based on findings]

Automation: Scheduled Chaos

Once you have confidence in your manual experiments, automate them:

# docker-compose.chaos.yml
services:
  chaos-runner:
    image: alpine:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./chaos-scripts:/scripts
    entrypoint: /bin/sh
    command: >
      -c "apk add --no-cache docker-cli curl &&
          /scripts/weekly-chaos.sh"
    profiles:
      - chaos

Run weekly with a cron job:

# Every Monday at 10 AM (when the team is awake and can respond)
0 10 * * 1 cd /app && docker compose --profile chaos run --rm chaos-runner

Never run chaos experiments on Friday afternoons or outside business hours. The goal is to learn, not to create real incidents.

What Good Looks Like

After 3-6 months of monthly chaos experiments, you should see:

Faster recovery times — you have practiced restoring services so many times it becomes routine
Better monitoring — every experiment reveals gaps in alerting
Confidence in deployments — you know what happens when things fail because you have seen it
Documented runbooks — each experiment produces documentation about how services behave under failure
Fewer 3 AM surprises — the failures that used to wake you up are now handled automatically

The Bottom Line

Chaos engineering is not a tool or a platform. It is a practice: break things on purpose, observe what happens, fix the gaps, repeat.

For small teams, the experiments are simple: kill a container, break a dependency, fill a disk, restore a backup. Each one takes 30 minutes and reveals gaps that would otherwise become 3 AM incidents.

Start this month. Pick one service. Kill it. Watch what happens. Fix what breaks. You will learn more in 30 minutes than in 30 hours of reading runbooks.

#chaos-engineering#resilience#docker#sre#fault-injection#reliability#devops

Related Service

Platform Engineering

From CI/CD pipelines to service meshes, we create golden paths for your developers.

Get a Consultation Chat on WhatsApp

Need help with devops?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.

Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Chaos Engineering for Small Teams: You Do Not Need Netflix to Break Things

Why Small Teams Need Chaos Engineering More Than Big Teams

The Simplest Chaos Experiments

Experiment 1: Kill a Container

Experiment 2: Fill the Disk

Experiment 3: Network Partition

Experiment 4: Slow Database

Experiment 5: DNS Failure

Building a Monthly Chaos Routine

Week 1: Container Failures

You might also like

Week 2: Dependency Failures

Week 3: Resource Exhaustion

Week 4: Backup and Recovery

The Chaos Engineering Checklist

Automation: Scheduled Chaos

What Good Looks Like

The Bottom Line

Platform Engineering

Need help with devops?

We Will Build You a Demo Site — For Free

Related Articles

eBPF Beyond Security: Networking, Observability, and Performance in One Technology

ArgoCD Beyond the Basics: Multi-Cluster GitOps Patterns That Scale

Version Control at Scale: Git Strategies That Survive 10,000 Commits