Self-Hosting 90+ Containers on a Single Server: Inside the PADC Infrastructure
How I run 90+ Docker containers on a single server for my Personal Autonomous Data Center — covering Docker Compose management, monitoring with...
Why 90+ Containers on One Server?
My Personal Autonomous Data Center (PADC) runs 90+ Docker containers on a single server. Not a beefy cloud instance — an actual physical machine with 14 GB of RAM.
Server infrastructure: production and staging environments connected via VLAN with offsite backups.
It hosts everything: Gitea (Git hosting), Directus (CMS), n8n (workflow automation), Postiz (social media scheduler), Grafana + Prometheus + Loki (monitoring), Traefik (reverse proxy), PostgreSQL, Redis, FalkorDB, multiple web applications, AI tools, and dozens of utility services.
People ask why not use Kubernetes, or split across multiple servers, or just use cloud services. The answer: I wanted to understand infrastructure deeply. Running everything on constrained hardware forces you to actually care about resource efficiency. When you have 256 GB of RAM and 64 cores, you can afford to be lazy. When you have 14 GB, every container's memory footprint matters.
This is how I manage it.
The Docker Compose Architecture
One Compose File to Rule Them All
Everything runs from a single docker-compose.yml. At 90+ services, this file is around 3,000 lines. Some people split into multiple compose files — I tried that and found the dependency management nightmare worse than having one large file.
# /mnt/projects/infra/docker-compose.yml (abbreviated)
version: '3.8'
services:
# === CORE INFRASTRUCTURE ===
traefik:
image: traefik:v3.0
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik:/etc/traefik
- ./acme:/acme
networks:
- web
deploy:
resources:
limits:
memory: 256M
reservations:
memory: 64M
postgres:
image: postgres:16-alpine
restart: unless-stopped
environment:
POSTGRES_PASSWORD: ${PG_PASSWORD}
volumes:
- pg-data:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d
networks:
- backend
deploy:
resources:
limits:
memory: 1G
reservations:
memory: 256M
command: >
postgres
-c shared_buffers=256MB
-c effective_cache_size=512MB
-c work_mem=4MB
-c maintenance_work_mem=64MB
-c max_connections=200
-c random_page_cost=1.1
redis:
image: redis:7-alpine
restart: unless-stopped
command: redis-server --maxmemory 128mb --maxmemory-policy allkeys-lru
networks:
- backend
deploy:
resources:
limits:
memory: 192M
# === APPLICATIONS (60+ services) ===
directus:
image: directus/directus:10
restart: unless-stopped
depends_on:
- postgres
- redis
environment:
DB_CLIENT: pg
DB_HOST: postgres
DB_DATABASE: directus
CACHE_ENABLED: 'true'
CACHE_STORE: redis
CACHE_REDIS_HOST: redis
networks:
- backend
- web
labels:
- traefik.enable=true
- traefik.http.routers.directus.rule=Host(`cms.techsaas.cloud`)
deploy:
resources:
limits:
memory: 512M
# ... 85+ more services
The Shared Database Pattern
Instead of running a PostgreSQL instance per application (the "microservices" way), most services share a single PostgreSQL instance with separate databases:
# init-scripts/00-create-databases.sql
CREATE DATABASE directus;
CREATE DATABASE gitea;
CREATE DATABASE n8n;
CREATE DATABASE postiz;
CREATE DATABASE keycloak;
CREATE DATABASE bookstack;
-- ... 15 more databases
-- Each app gets its own user with access to only its database
CREATE USER directus_app WITH PASSWORD '...';
GRANT ALL PRIVILEGES ON DATABASE directus TO directus_app;
This saves ~200MB per application that would have run its own PostgreSQL. With 15+ apps using PostgreSQL, that's 3 GB saved.
Network Isolation
networks:
web: # Public-facing services (Traefik frontend)
backend: # Database + cache layer
monitoring: # Prometheus, Grafana, Loki, Promtail
ai: # AI services (LLM, embedding, etc.)
Services join only the networks they need. The CMS joins web (for Traefik routing) and backend (for database access). Prometheus joins monitoring and backend (to scrape database metrics). No service has access to networks it doesn't need.
The Monitoring Stack
With 90+ containers, monitoring isn't optional. Here's the stack:
Get more insights on DevOps
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
┌─────────────────────────────────────────────────┐
│ Grafana (dashboards) │
│ CPU/Memory │ Logs │ Alerts │ Container │
├─────────────┬───────────────┬───────────────────┤
│ Prometheus │ Loki │ Alertmanager │
│ (metrics) │ (logs) │ (notifications)│
├─────────────┼───────────────┤ │
│ cAdvisor │ Promtail │ │
│ node_exp │ (log shipper) │ │
└─────────────┴───────────────┴───────────────────┘
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 30s # 30s instead of default 15s — saves memory
evaluation_interval: 30s
scrape_timeout: 10s
scrape_configs:
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Only collect container metrics we actually use
metric_relabel_configs:
- source_labels: [__name__]
regex: 'container_(cpu_usage_seconds_total|memory_usage_bytes|memory_working_set_bytes|network_.*_bytes_total|fs_usage_bytes)'
action: keep
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'traefik'
static_configs:
- targets: ['traefik:8082']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
The metric_relabel_configs block is critical. cAdvisor exposes hundreds of metrics per container. With 90 containers, that's thousands of time series. We keep only the metrics we actually dashboard and alert on — CPU, memory, network, and disk. This cuts Prometheus memory usage by 60%.
Loki for Logs
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 168h # 7 days — disk is limited
max_query_series: 5000
ingestion_rate_mb: 4
ingestion_burst_size_mb: 8
compactor:
working_directory: /loki/compactor
retention_enabled: true
Seven-day log retention keeps disk usage manageable. For longer retention, I ship critical logs to S3-compatible storage (MinIO, also running in a container).
Key Alerts
# alerting-rules.yml
groups:
- name: container-health
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"}) or (time() - container_last_seen{name=~".+"} > 300)
for: 5m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes{name=~".+"}
/ on(name) container_spec_memory_limit_bytes{name=~".+"}) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.name }} using {{ $value | humanizePercentage }} of memory limit"
- alert: HostMemoryPressure
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.10
for: 5m
labels:
severity: critical
annotations:
summary: "Host has less than 10% memory available"
- alert: SwapGrowing
expr: rate(node_memory_SwapTotal_bytes[1h]) > 0 and node_memory_SwapFree_bytes < (node_memory_SwapTotal_bytes * 0.3)
for: 15m
labels:
severity: warning
annotations:
summary: "Swap usage above 70% — system under memory pressure"
Alerts go to ntfy (self-hosted notification service — also running in a container), which pushes to my phone.
Resource Optimization Strategies
Cloud to self-hosted migration can dramatically reduce infrastructure costs while maintaining full control.
Strategy 1: Alpine and Slim Base Images
Every container uses the smallest viable base image:
# Good: Alpine variants save 100-500MB per container
postgres: postgres:16-alpine # 80MB vs 380MB
redis: redis:7-alpine # 30MB vs 130MB
node-app: node:20-alpine # 50MB vs 350MB
# Good: Slim variants for Debian-based
python-app: python:3.12-slim # 130MB vs 900MB
Across 90 containers, using slim/alpine images instead of full images saves roughly 15-20 GB of disk space and reduces memory overhead from shared library loading.
Strategy 2: Memory Limits on Everything
Every container has explicit memory limits:
services:
# Stateless web UIs — minimal memory
excalidraw:
deploy:
resources:
limits:
memory: 64M
it-tools:
deploy:
resources:
limits:
memory: 64M
# Application servers — moderate memory
n8n:
deploy:
resources:
limits:
memory: 512M
gitea:
deploy:
resources:
limits:
memory: 384M
# Databases — controlled allocation
postgres:
deploy:
resources:
limits:
memory: 1G
# Monitoring — Prometheus is the hungriest
prometheus:
deploy:
resources:
limits:
memory: 768M
Without limits, a single misbehaving container can OOM-kill everything. With limits, the misbehaving container gets killed while everything else keeps running.
Strategy 3: Restart Policies
Different restart policies for different service types:
# Stateless services: always restart (even on exit code 0)
excalidraw:
restart: always # Nginx-based, no state, just restart it
# Stateful services: unless-stopped (respect manual stops)
postgres:
restart: unless-stopped
# Batch jobs: no restart (run once and done)
backup-worker:
restart: "no"
I learned the hard way that unless-stopped does NOT restart a container that exits with code 0. For stateless nginx containers that occasionally exit cleanly due to config reload edge cases, always is the right policy.
Strategy 4: Shared PostgreSQL with Tuned Settings
# PostgreSQL tuning for constrained memory
postgres \
-c shared_buffers=256MB \ # 25% of postgres memory limit
-c effective_cache_size=512MB \ # Total expected cache
-c work_mem=4MB \ # Per-sort operation (low!)
-c maintenance_work_mem=64MB \ # For VACUUM, INDEX builds
-c max_connections=200 \ # 15 apps × ~10 connections each
-c random_page_cost=1.1 \ # SSD storage
-c wal_buffers=8MB \ # Write-ahead log buffer
-c checkpoint_completion_target=0.9
work_mem=4MB is aggressively low. With 200 connections, worst case is 200 × 4MB = 800MB just for sort operations. Setting it higher (the default is 4MB, some guides recommend 64MB) would risk 200 × 64MB = 12.8 GB — nearly all available RAM.
Strategy 5: Swap as Safety Net
# 8GB swap file
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Tuning: prefer RAM, use swap only under pressure
vm.swappiness=10
vm.vfs_cache_pressure=50
Swap is not a substitute for RAM. It's a safety net that prevents OOM kills during traffic spikes. With swappiness=10, the kernel strongly prefers RAM and only swaps under real pressure.
On a typical day, swap usage sits at 2-4 GB. During peak loads (all services active, multiple builds running, monitoring ingesting), it can spike to 8-10 GB. The system remains responsive because only inactive memory pages get swapped.
Traefik: The Single Entry Point
All HTTP traffic enters through Traefik, which handles SSL termination, routing, and load balancing:
# traefik.yml
entryPoints:
web:
address: ":80"
http:
redirections:
entryPoint:
to: websecure
scheme: https
websecure:
address: ":443"
http:
tls:
certResolver: letsencrypt
certificatesResolvers:
letsencrypt:
acme:
email: [email protected]
storage: /acme/acme.json
httpChallenge:
entryPoint: web
providers:
docker:
exposedByDefault: false
network: web
Services register themselves via Docker labels:
# Any service becomes publicly accessible with 3 labels
my-app:
labels:
- traefik.enable=true
- traefik.http.routers.my-app.rule=Host(`app.techsaas.cloud`)
- traefik.http.routers.my-app.tls.certresolver=letsencrypt
Traefik automatically discovers services, obtains Let's Encrypt certificates, and routes traffic. Adding a new public service takes 30 seconds.
Free Resource
CI/CD Pipeline Blueprint
Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.
For services that shouldn't be public, I use Cloudflare Tunnels:
cloudflared:
image: cloudflare/cloudflared:latest
restart: unless-stopped
command: tunnel run
environment:
TUNNEL_TOKEN: ${CF_TUNNEL_TOKEN}
networks:
- web
This exposes internal services through Cloudflare's network without opening any inbound ports. Grafana, Gitea, and admin panels are accessible through the tunnel with Cloudflare Access providing authentication.
Backup Strategy
#!/bin/bash
# backup.sh — runs daily at 3 AM via cron
BACKUP_DIR=/mnt/backups/$(date +%Y-%m-%d)
mkdir -p $BACKUP_DIR
# PostgreSQL: dump all databases
docker exec postgres pg_dumpall -U postgres | gzip > $BACKUP_DIR/postgres.sql.gz
# Docker volumes: selective backup
for vol in gitea-data directus-uploads n8n-data bookstack-data; do
docker run --rm -v ${vol}:/data -v $BACKUP_DIR:/backup \
alpine tar czf /backup/${vol}.tar.gz -C /data .
done
# Configuration files
tar czf $BACKUP_DIR/config.tar.gz \
/mnt/projects/infra/docker-compose.yml \
/mnt/projects/infra/traefik/ \
/mnt/projects/infra/.env
# Retention: keep 7 daily, 4 weekly
find /mnt/backups -maxdepth 1 -mtime +30 -type d -exec rm -rf {} \;
# Upload to S3-compatible storage
rclone sync $BACKUP_DIR remote:padc-backups/$(date +%Y-%m-%d)/
The Daily Reality
Here's what a typical day looks like resource-wise:
$ docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"
NAME MEM USAGE CPU %
postgres 812MB / 1GB 2.3%
prometheus 624MB / 768MB 1.1%
loki 445MB / 512MB 0.8%
directus 387MB / 512MB 1.5%
postiz 356MB / 512MB 3.2%
n8n 298MB / 512MB 0.9%
gitea 267MB / 384MB 0.4%
grafana 198MB / 256MB 0.3%
traefik 142MB / 256MB 0.5%
redis 98MB / 192MB 0.1%
... (80+ more containers between 20-150MB each)
Total memory: ~11-12 GB used out of 14 GB, with 2-4 GB in swap. CPU averages 15-25% utilization with spikes to 60-70% during builds or heavy API usage.
What I'd Do Differently
Start with memory limits from day one. I added them retroactively after the first OOM incident. Some containers had been silently consuming 2 GB.
Use Loki from the start, not ELK. I initially ran Elasticsearch for logs. It consumed 2 GB of RAM by itself. Loki does the same job in 400 MB.
Invest in proper secret management earlier. I started with
.envfiles. Moving to proper secrets management after 60+ services was painful.Don't run databases without connection pooling. PgBouncer should have been there from the start. Without it, idle connections from 15 applications consumed significant PostgreSQL memory.
Container orchestration distributes workloads across multiple nodes for resilience and scale.
The Bottom Line
Running 90+ containers on 14 GB of RAM is possible, educational, and occasionally stressful. The constraints force good engineering habits: memory limits, efficient base images, shared infrastructure, aggressive monitoring.
Is this the right architecture for a team? Probably not — you'd want Kubernetes for multi-node scaling and proper high availability. But for a personal infrastructure that needs to run dozens of services reliably, Docker Compose on a single well-monitored server is surprisingly effective.
The key insight: constrained resources don't limit what you can build. They limit what you can waste.
Related Service
Platform Engineering
From CI/CD pipelines to service meshes, we create golden paths for your developers.
Need help with devops?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.