← All articlesCloud Infrastructure

How We Built Self-Healing Infrastructure With 90+ Docker Containers

How we run 90+ Docker containers on a single host with self-healing: docker-autoheal, tuned healthchecks, real resource limits, and Prometheus monitoring.

T
TechSaaS Team
10 min read read

How We Built Self-Healing Infrastructure With 90+ Docker Containers

At TechSaaS, we run 90+ Docker containers on a single host. No Kubernetes. No Swarm. Just Docker Compose, a healthcheck on every service, and an autoheal container that fixes problems before we wake up.

This isn't a theoretical architecture guide. Every config in this article is pulled directly from our production docker-compose.yml. Every metric is from our live monitoring stack. We'll show you what works, what broke, and the gotchas we discovered the hard way.

Why Docker Compose at This Scale

The default advice for running dozens of containers is "use Kubernetes." For our workload profile, that's the wrong answer.

Our constraints: single host, 14 GB RAM, mostly long-running services (databases, web apps, monitoring tools, dev infrastructure). We don't need multi-node scheduling, rolling deployments across clusters, or service mesh complexity. We need containers that start, stay healthy, and restart when they break.

Docker Compose gives us a single docker-compose.yml that defines all 90 services, their networks, volumes, resource limits, and healthchecks. One file, one docker compose up -d, and the entire stack is running.

Kubernetes solves real problems — multi-node orchestration, horizontal scaling, complex networking. We don't have those problems. If you're running everything on one host with predictable workloads, Compose is the right tool. We cover the full setup in our Docker Compose for production guide.

The honest trade-off: we rely heavily on swap. Our host has 14 GB RAM and 88 GB of swap space. At any given moment, about 19 GB of swap is in use. This means some containers are partially swapped out — fine for rarely-accessed services like documentation or analytics, but it causes latency spikes when those services suddenly need to respond. Swap lets us pack 90 containers into 14 GB, but it's not free.

Health Checks That Actually Detect Failures

A healthcheck is a command Docker runs periodically inside the container. If it fails repeatedly, the container is marked unhealthy. Without healthchecks, Docker only knows if the process is running — not if it's actually working.

We use different healthcheck patterns depending on the service type.

Databases: Native CLI Tools, Fast Intervals

postgres:
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U padc_admin -d padc"]
    interval: 10s
    timeout: 5s
    retries: 5

redis:
  healthcheck:
    test: ["CMD", "redis-cli", "ping"]
    interval: 10s
    timeout: 5s
    retries: 5

mongo:
  healthcheck:
    test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
    interval: 10s
    timeout: 5s
    retries: 5

rabbitmq:
  healthcheck:
    test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
    interval: 30s
    timeout: 10s
    retries: 5

Databases boot fast (2-3 seconds), so they don't need a start_period. We use the native CLI tools already inside the container — pg_isready, redis-cli, mongosh. No external dependencies.

HTTP Services: Curl or Wget, Longer Start Period

Get more insights on Cloud Infrastructure

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

metabase:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://127.0.0.1:3000/api/health"]
    interval: 30s
    timeout: 10s
    retries: 5
    start_period: 60s

crowdsec:
  healthcheck:
    test: ["CMD", "wget", "--spider", "--quiet", "http://127.0.0.1:8080/health"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 30s

JVM-based apps like Metabase need a generous start_period (60s) because they're slow to initialize. During start_period, healthcheck failures don't count toward retries — the container stays in starting state.

Gotcha: containers using depends_on: condition: service_healthy will wait for the full start_period before starting. If Metabase has start_period: 60s, anything depending on it waits 60+ seconds on boot. Tune start_period per service — databases get 0s, heavy apps get 30-60s.

Infrastructure: Native Diagnostic Commands

cloudflared:
  healthcheck:
    test: ["CMD", "cloudflared", "tunnel", "ready"]
    interval: 30s
    timeout: 10s
    retries: 3
    start_period: 15s

Infrastructure services like Cloudflare Tunnel have their own health commands. Use them instead of generic HTTP checks.

When Healthchecks Lie

Here's a real incident: our Postgres container passed its healthcheck (pg_isready returned 0) but was stuck in recovery mode after an OOM kill. It was accepting connections but rejecting all queries. The container showed healthy because pg_isready only verifies the connection socket — it doesn't execute a query.

Lesson: healthchecks verify liveness ("process is running") not readiness ("service is functional"). We now run pg_isready for Docker healthchecks and a separate SELECT 1 query via Prometheus postgres_exporter for true readiness monitoring. Healthchecks are necessary but not sufficient.

Auto-Healing With docker-autoheal

Here's the gap that catches everyone: Docker's restart policy does NOT restart unhealthy containers.

restart: always and restart: unless-stopped only trigger on process exit (non-zero exit code or OOM kill). A container stuck in unhealthy state with the main process still running will sit there indefinitely. This is moby/moby#22719 — open since 2016, still not resolved.

Docker Swarm services DO respect health status for rescheduling. Standalone Docker does not. For standalone Compose, you need docker-autoheal.

Our production config:

autoheal:
  image: willfarrell/autoheal:latest
  container_name: autoheal
  restart: unless-stopped
  environment:
    - AUTOHEAL_CONTAINER_LABEL=all
    - AUTOHEAL_INTERVAL=30
    - AUTOHEAL_START_PERIOD=60
    - AUTOHEAL_DEFAULT_STOP_TIMEOUT=10
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  mem_limit: 32m
  cpus: 0.1

AUTOHEAL_CONTAINER_LABEL=all monitors every container with a healthcheck — no per-container labels needed. We set AUTOHEAL_INTERVAL=30 (default is 5s) to reduce Docker API chatter across 90 containers. AUTOHEAL_START_PERIOD=60 prevents false positives during host reboot when everything starts at once.

The restart loop problem: if a container is fundamentally broken (bad config, missing env var), autoheal will restart it every 30 seconds forever. There's no built-in circuit breaker. We catch this with a Prometheus alert: changes(container_restart_count[1h]) > 5 — if a container restarts more than 5 times in an hour, something is wrong beyond what a restart can fix.

Security consideration: autoheal mounts /var/run/docker.sock, giving it full Docker API access. A compromise of the autoheal container means full host control. We accept this trade-off because autoheal needs restart capability, but it's a risk to be aware of. For hardening guidance, see our Docker container security best practices.

Resource Limits: The Compose Syntax Everyone Gets Wrong

Critical: the deploy.resources block that many guides recommend is silently ignored by docker compose up. It only works with docker stack deploy (Swarm mode). For standalone Compose, resource limits go directly on the service:

services:
  myapp:
    image: myapp:latest
    mem_limit: 256m
    memswap_limit: 512m
    cpus: 0.5

We have memory limits set on 84 of our 90 containers. The remaining 6 are recently-deployed static sites that still need limits added.

Here are real numbers from our stack:

Container Type Mem Limit Actual Usage Utilization
company-website nginx static 32 MB 7.4 MB 23%
umami analytics 256 MB 53.5 MB 21%
contact-api Node.js API 64 MB 4.6 MB 7%
n8n automation 1 GB 34.9 MB 3%
autoheal monitor 32 MB 5.1 MB 16%
cloudflared tunnel 128 MB 17.8 MB 14%
node-exporter metrics 64 MB ~20 MB 31%
grafana dashboards 256 MB ~120 MB 47%

Our sizing rule: set mem_limit at 2-4x the observed peak usage. Nginx static sites get 16-32 MB. Small Node.js APIs get 64-256 MB. Databases get 512 MB. Heavy applications (Metabase, n8n, Prometheus) get 1 GB.

We don't set pids_limit on any container. For most workloads it's unnecessary — fork bombs are a real threat in multi-tenant environments, but in a single-operator stack the risk is low. We'd add it if we ever ran untrusted workloads.

Monitoring the Healers: Prometheus + Loki + Uptime Kuma

Self-healing is only as good as your ability to observe it. Our monitoring stack:

  • Prometheus (mem_limit: 1 GB) — scrapes metrics from every container via cAdvisor and node-exporter
  • Grafana (mem_limit: 256 MB) — dashboards for container resource usage, restart counts, and health status
  • Loki + Promtail — centralized log aggregation from all 90 containers
  • Uptime Kuma — external HTTP checks against our public-facing services
  • cAdvisor — per-container CPU, memory, network, and disk I/O metrics
  • node-exporter (mem_limit: 64 MB) — host-level metrics (CPU, RAM, disk, swap)

The alerts that matter:

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist
  • Container restart loop: changes(container_restart_count[1h]) > 5
  • Memory approaching limit: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
  • Swap usage growing: node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes < 0.7
  • Host disk filling: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.15

The combination of autoheal (fixes problems) + Prometheus (detects patterns) + Loki (explains why) gives us confidence that problems are either auto-resolved or immediately visible. For a deep dive into our monitoring setup, see how we monitor 90+ Docker containers with Prometheus, Grafana, and Loki.

restart: unless-stopped vs restart: always

Our stack mostly uses unless-stopped. The difference matters:

  • unless-stopped: restarts on crash, restarts after daemon restart, but does not restart containers you manually stopped with docker stop. This lets us take services down for maintenance without fighting the restart policy.
  • always: restarts in all cases, including after manual stop + daemon restart. We use this only for stateless containers that must never be down (some nginx frontends).

Use unless-stopped as your default. Use always only for services where any downtime is unacceptable and you'll never need to manually stop them.

What We'd Do Differently

  1. Set pids_limit on all containers from day one. We haven't needed it, but it's a cheap safety net we should have.
  2. Better healthcheck queries for databases. pg_isready should be supplemented with SELECT 1 to catch the recovery-mode-but-accepting-connections failure mode.
  3. Monitor swap per-container. We monitor total host swap, but we can't easily see which containers are swapped out. This makes it hard to optimize the right services.
  4. Add alerting earlier. We ran for months with autoheal quietly restarting things before we added Prometheus alerts for restart counts. We were flying blind on failure patterns.

The Stack at a Glance

90+ containers. 14 GB RAM + 19 GB swap. Healthchecks on every service. Autoheal monitoring all of them. Resource limits on 84 containers. Prometheus + Grafana + Loki for observability. Uptime Kuma for external verification.

No Kubernetes. No Swarm. No managed services. Just Docker Compose, good defaults, and honest monitoring.

Your containers are going to fail. The question is whether you find out at 3 AM or whether your infrastructure fixes itself and tells you about it in the morning.


Related reading:

Need help building resilient infrastructure? Explore our cloud infrastructure and DevOps services or cybersecurity and compliance consulting.

#Docker#Self-Healing#DevOps#Monitoring#Prometheus#Docker Compose#Infrastructure

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with cloud infrastructure?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.