Self-Hosting 90+ Containers on a Single Server: Inside the PADC Infrastructure

How I run 90+ Docker containers on a single server for my Personal Autonomous Data Center — covering Docker Compose management, monitoring with...

T
TechSaaS Team
14 min read

Why 90+ Containers on One Server?

My Personal Autonomous Data Center (PADC) runs 90+ Docker containers on a single server. Not a beefy cloud instance — an actual physical machine with 14 GB of RAM.

ProductionWeb ServerApp ServerDatabaseMonitoringStagingWeb ServerApp ServerDatabaseVLANBackupStorage3-2-1 Rule

Server infrastructure: production and staging environments connected via VLAN with offsite backups.

It hosts everything: Gitea (Git hosting), Directus (CMS), n8n (workflow automation), Postiz (social media scheduler), Grafana + Prometheus + Loki (monitoring), Traefik (reverse proxy), PostgreSQL, Redis, FalkorDB, multiple web applications, AI tools, and dozens of utility services.

People ask why not use Kubernetes, or split across multiple servers, or just use cloud services. The answer: I wanted to understand infrastructure deeply. Running everything on constrained hardware forces you to actually care about resource efficiency. When you have 256 GB of RAM and 64 cores, you can afford to be lazy. When you have 14 GB, every container's memory footprint matters.

This is how I manage it.

The Docker Compose Architecture

One Compose File to Rule Them All

Everything runs from a single docker-compose.yml. At 90+ services, this file is around 3,000 lines. Some people split into multiple compose files — I tried that and found the dependency management nightmare worse than having one large file.

# /mnt/projects/infra/docker-compose.yml (abbreviated)
version: '3.8'

services:
  # === CORE INFRASTRUCTURE ===
  traefik:
    image: traefik:v3.0
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik:/etc/traefik
      - ./acme:/acme
    networks:
      - web
    deploy:
      resources:
        limits:
          memory: 256M
        reservations:
          memory: 64M

  postgres:
    image: postgres:16-alpine
    restart: unless-stopped
    environment:
      POSTGRES_PASSWORD: ${PG_PASSWORD}
    volumes:
      - pg-data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    networks:
      - backend
    deploy:
      resources:
        limits:
          memory: 1G
        reservations:
          memory: 256M
    command: >
      postgres
        -c shared_buffers=256MB
        -c effective_cache_size=512MB
        -c work_mem=4MB
        -c maintenance_work_mem=64MB
        -c max_connections=200
        -c random_page_cost=1.1

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
    networks:
      - backend
    deploy:
      resources:
        limits:
          memory: 192M

  # === APPLICATIONS (60+ services) ===
  directus:
    image: directus/directus:10
    restart: unless-stopped
    depends_on:
      - postgres
      - redis
    environment:
      DB_CLIENT: pg
      DB_HOST: postgres
      DB_DATABASE: directus
      CACHE_ENABLED: 'true'
      CACHE_STORE: redis
      CACHE_REDIS_HOST: redis
    networks:
      - backend
      - web
    labels:
      - traefik.enable=true
      - traefik.http.routers.directus.rule=Host(`cms.techsaas.cloud`)
    deploy:
      resources:
        limits:
          memory: 512M

  # ... 85+ more services

The Shared Database Pattern

Instead of running a PostgreSQL instance per application (the "microservices" way), most services share a single PostgreSQL instance with separate databases:

# init-scripts/00-create-databases.sql
CREATE DATABASE directus;
CREATE DATABASE gitea;
CREATE DATABASE n8n;
CREATE DATABASE postiz;
CREATE DATABASE keycloak;
CREATE DATABASE bookstack;
-- ... 15 more databases

-- Each app gets its own user with access to only its database
CREATE USER directus_app WITH PASSWORD '...';
GRANT ALL PRIVILEGES ON DATABASE directus TO directus_app;

This saves ~200MB per application that would have run its own PostgreSQL. With 15+ apps using PostgreSQL, that's 3 GB saved.

Network Isolation

networks:
  web:        # Public-facing services (Traefik frontend)
  backend:    # Database + cache layer
  monitoring: # Prometheus, Grafana, Loki, Promtail
  ai:         # AI services (LLM, embedding, etc.)

Services join only the networks they need. The CMS joins web (for Traefik routing) and backend (for database access). Prometheus joins monitoring and backend (to scrape database metrics). No service has access to networks it doesn't need.

The Monitoring Stack

With 90+ containers, monitoring isn't optional. Here's the stack:

Get more insights on DevOps

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

┌─────────────────────────────────────────────────┐
│              Grafana (dashboards)                │
│     CPU/Memory │ Logs │ Alerts │ Container       │
├─────────────┬───────────────┬───────────────────┤
│  Prometheus │     Loki      │    Alertmanager   │
│  (metrics)  │    (logs)     │    (notifications)│
├─────────────┼───────────────┤                   │
│  cAdvisor   │   Promtail    │                   │
│  node_exp   │ (log shipper) │                   │
└─────────────┴───────────────┴───────────────────┘

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 30s     # 30s instead of default 15s — saves memory
  evaluation_interval: 30s
  scrape_timeout: 10s

scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    # Only collect container metrics we actually use
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'container_(cpu_usage_seconds_total|memory_usage_bytes|memory_working_set_bytes|network_.*_bytes_total|fs_usage_bytes)'
        action: keep

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'traefik'
    static_configs:
      - targets: ['traefik:8082']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

The metric_relabel_configs block is critical. cAdvisor exposes hundreds of metrics per container. With 90 containers, that's thousands of time series. We keep only the metrics we actually dashboard and alert on — CPU, memory, network, and disk. This cuts Prometheus memory usage by 60%.

Loki for Logs

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 168h   # 7 days — disk is limited
  max_query_series: 5000
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 8

compactor:
  working_directory: /loki/compactor
  retention_enabled: true

Seven-day log retention keeps disk usage manageable. For longer retention, I ship critical logs to S3-compatible storage (MinIO, also running in a container).

Key Alerts

# alerting-rules.yml
groups:
  - name: container-health
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"} > 300)
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} is down"

      - alert: HighMemoryUsage
        expr: |
          (container_memory_working_set_bytes{name=~".+"}
          / on(name) container_spec_memory_limit_bytes{name=~".+"}) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.name }} using {{ $value | humanizePercentage }} of memory limit"

      - alert: HostMemoryPressure
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host has less than 10% memory available"

      - alert: SwapGrowing
        expr: rate(node_memory_SwapTotal_bytes[1h]) > 0 and node_memory_SwapFree_bytes < (node_memory_SwapTotal_bytes * 0.3)
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Swap usage above 70% — system under memory pressure"

Alerts go to ntfy (self-hosted notification service — also running in a container), which pushes to my phone.

Resource Optimization Strategies

Cloud$5,000/moMigrateBare MetalDocker + LXC$200/mo96% cost reduction

Cloud to self-hosted migration can dramatically reduce infrastructure costs while maintaining full control.

Strategy 1: Alpine and Slim Base Images

Every container uses the smallest viable base image:

# Good: Alpine variants save 100-500MB per container
postgres: postgres:16-alpine       # 80MB vs 380MB
redis: redis:7-alpine               # 30MB vs 130MB
node-app: node:20-alpine            # 50MB vs 350MB

# Good: Slim variants for Debian-based
python-app: python:3.12-slim        # 130MB vs 900MB

Across 90 containers, using slim/alpine images instead of full images saves roughly 15-20 GB of disk space and reduces memory overhead from shared library loading.

Strategy 2: Memory Limits on Everything

Every container has explicit memory limits:

services:
  # Stateless web UIs — minimal memory
  excalidraw:
    deploy:
      resources:
        limits:
          memory: 64M

  it-tools:
    deploy:
      resources:
        limits:
          memory: 64M

  # Application servers — moderate memory
  n8n:
    deploy:
      resources:
        limits:
          memory: 512M

  gitea:
    deploy:
      resources:
        limits:
          memory: 384M

  # Databases — controlled allocation
  postgres:
    deploy:
      resources:
        limits:
          memory: 1G

  # Monitoring — Prometheus is the hungriest
  prometheus:
    deploy:
      resources:
        limits:
          memory: 768M

Without limits, a single misbehaving container can OOM-kill everything. With limits, the misbehaving container gets killed while everything else keeps running.

Strategy 3: Restart Policies

Different restart policies for different service types:

# Stateless services: always restart (even on exit code 0)
excalidraw:
  restart: always   # Nginx-based, no state, just restart it

# Stateful services: unless-stopped (respect manual stops)
postgres:
  restart: unless-stopped

# Batch jobs: no restart (run once and done)
backup-worker:
  restart: "no"

I learned the hard way that unless-stopped does NOT restart a container that exits with code 0. For stateless nginx containers that occasionally exit cleanly due to config reload edge cases, always is the right policy.

Strategy 4: Shared PostgreSQL with Tuned Settings

# PostgreSQL tuning for constrained memory
postgres \
  -c shared_buffers=256MB \         # 25% of postgres memory limit
  -c effective_cache_size=512MB \    # Total expected cache
  -c work_mem=4MB \                  # Per-sort operation (low!)
  -c maintenance_work_mem=64MB \     # For VACUUM, INDEX builds
  -c max_connections=200 \           # 15 apps × ~10 connections each
  -c random_page_cost=1.1 \          # SSD storage
  -c wal_buffers=8MB \               # Write-ahead log buffer
  -c checkpoint_completion_target=0.9

work_mem=4MB is aggressively low. With 200 connections, worst case is 200 × 4MB = 800MB just for sort operations. Setting it higher (the default is 4MB, some guides recommend 64MB) would risk 200 × 64MB = 12.8 GB — nearly all available RAM.

Strategy 5: Swap as Safety Net

# 8GB swap file
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Tuning: prefer RAM, use swap only under pressure
vm.swappiness=10
vm.vfs_cache_pressure=50

Swap is not a substitute for RAM. It's a safety net that prevents OOM kills during traffic spikes. With swappiness=10, the kernel strongly prefers RAM and only swaps under real pressure.

On a typical day, swap usage sits at 2-4 GB. During peak loads (all services active, multiple builds running, monitoring ingesting), it can spike to 8-10 GB. The system remains responsive because only inactive memory pages get swapped.

Traefik: The Single Entry Point

All HTTP traffic enters through Traefik, which handles SSL termination, routing, and load balancing:

# traefik.yml
entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entryPoint:
          to: websecure
          scheme: https
  websecure:
    address: ":443"
    http:
      tls:
        certResolver: letsencrypt

certificatesResolvers:
  letsencrypt:
    acme:
      email: [email protected]
      storage: /acme/acme.json
      httpChallenge:
        entryPoint: web

providers:
  docker:
    exposedByDefault: false
    network: web

Services register themselves via Docker labels:

# Any service becomes publicly accessible with 3 labels
my-app:
  labels:
    - traefik.enable=true
    - traefik.http.routers.my-app.rule=Host(`app.techsaas.cloud`)
    - traefik.http.routers.my-app.tls.certresolver=letsencrypt

Traefik automatically discovers services, obtains Let's Encrypt certificates, and routes traffic. Adding a new public service takes 30 seconds.

Free Resource

CI/CD Pipeline Blueprint

Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.

Get the Blueprint

For services that shouldn't be public, I use Cloudflare Tunnels:

cloudflared:
  image: cloudflare/cloudflared:latest
  restart: unless-stopped
  command: tunnel run
  environment:
    TUNNEL_TOKEN: ${CF_TUNNEL_TOKEN}
  networks:
    - web

This exposes internal services through Cloudflare's network without opening any inbound ports. Grafana, Gitea, and admin panels are accessible through the tunnel with Cloudflare Access providing authentication.

Backup Strategy

#!/bin/bash
# backup.sh — runs daily at 3 AM via cron

BACKUP_DIR=/mnt/backups/$(date +%Y-%m-%d)
mkdir -p $BACKUP_DIR

# PostgreSQL: dump all databases
docker exec postgres pg_dumpall -U postgres | gzip > $BACKUP_DIR/postgres.sql.gz

# Docker volumes: selective backup
for vol in gitea-data directus-uploads n8n-data bookstack-data; do
  docker run --rm -v ${vol}:/data -v $BACKUP_DIR:/backup \
    alpine tar czf /backup/${vol}.tar.gz -C /data .
done

# Configuration files
tar czf $BACKUP_DIR/config.tar.gz \
  /mnt/projects/infra/docker-compose.yml \
  /mnt/projects/infra/traefik/ \
  /mnt/projects/infra/.env

# Retention: keep 7 daily, 4 weekly
find /mnt/backups -maxdepth 1 -mtime +30 -type d -exec rm -rf {} \;

# Upload to S3-compatible storage
rclone sync $BACKUP_DIR remote:padc-backups/$(date +%Y-%m-%d)/

The Daily Reality

Here's what a typical day looks like resource-wise:

$ docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"

NAME                MEM USAGE       CPU %
postgres            812MB / 1GB     2.3%
prometheus          624MB / 768MB   1.1%
loki                445MB / 512MB   0.8%
directus            387MB / 512MB   1.5%
postiz              356MB / 512MB   3.2%
n8n                 298MB / 512MB   0.9%
gitea               267MB / 384MB   0.4%
grafana             198MB / 256MB   0.3%
traefik             142MB / 256MB   0.5%
redis               98MB / 192MB    0.1%
... (80+ more containers between 20-150MB each)

Total memory: ~11-12 GB used out of 14 GB, with 2-4 GB in swap. CPU averages 15-25% utilization with spikes to 60-70% during builds or heavy API usage.

What I'd Do Differently

  1. Start with memory limits from day one. I added them retroactively after the first OOM incident. Some containers had been silently consuming 2 GB.

  2. Use Loki from the start, not ELK. I initially ran Elasticsearch for logs. It consumed 2 GB of RAM by itself. Loki does the same job in 400 MB.

  3. Invest in proper secret management earlier. I started with .env files. Moving to proper secrets management after 60+ services was painful.

  4. Don't run databases without connection pooling. PgBouncer should have been there from the start. Without it, idle connections from 15 applications consumed significant PostgreSQL memory.

OrchestratorNode 1Container AContainer BNode 2Container CContainer ANode 3Container BContainer D

Container orchestration distributes workloads across multiple nodes for resilience and scale.

The Bottom Line

Running 90+ containers on 14 GB of RAM is possible, educational, and occasionally stressful. The constraints force good engineering habits: memory limits, efficient base images, shared infrastructure, aggressive monitoring.

Is this the right architecture for a team? Probably not — you'd want Kubernetes for multi-node scaling and proper high availability. But for a personal infrastructure that needs to run dozens of services reliably, Docker Compose on a single well-monitored server is surprisingly effective.

The key insight: constrained resources don't limit what you can build. They limit what you can waste.

#docker#self-hosting#devops#monitoring#infrastructure#homelab#containers

Related Service

Platform Engineering

From CI/CD pipelines to service meshes, we create golden paths for your developers.

Need help with devops?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.