How We Monitor 90+ Docker Containers with Prometheus, Grafana, and Loki

A production-tested guide to monitoring 90+ Docker containers on constrained hardware. Covers Prometheus metric collection, Grafana dashboards, Loki log aggregation, alerting via Alertmanager, and the specific optimizations that keep our monitoring stack under 1.5 GB of RAM.

T
TechSaaS Team
15 min read

The Monitoring Problem at Scale on Constrained Hardware

When you run 90+ Docker containers on a single server with 14 GB of RAM, monitoring becomes both critically important and resource-constrained. You can't afford to not monitor — a single runaway container can OOM-kill your entire stack. But you also can't afford to dedicate 4 GB to monitoring when your applications need that memory to run.

Our monitoring stack — Prometheus, Grafana, Loki, Promtail, cAdvisor, and node-exporter — runs on less than 1.5 GB total. It monitors every container, collects 47,000+ active time series, aggregates logs from all 90+ services, and alerts on 23 conditions.

This is the exact configuration we run in production.

Architecture

┌────────────────────────────────────────────────────────────┐
│                     Docker Host (14 GB RAM)                  │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  90+ Application Containers                          │    │
│  │  (postgres, redis, directus, gitea, n8n, ...)       │    │
│  └──────────────────┬──────────────────────────────────┘    │
│                     │ Docker socket + log files              │
│  ┌──────────────────┼──────────────────────────────────┐    │
│  │  Monitoring Stack│(~1.5 GB total)                    │    │
│  │                  │                                    │    │
│  │  ┌───────────┐  │  ┌───────────┐  ┌──────────────┐  │    │
│  │  │ cAdvisor  │──┘  │ Promtail  │  │ node_exporter│  │    │
│  │  │ (150 MB)  │     │ (80 MB)   │  │ (20 MB)      │  │    │
│  │  └─────┬─────┘     └─────┬─────┘  └──────┬───────┘  │    │
│  │        │                 │               │           │    │
│  │        ▼                 ▼               │           │    │
│  │  ┌───────────┐     ┌──────────┐          │           │    │
│  │  │Prometheus │◀────┤          │◀─────────┘           │    │
│  │  │ (650 MB)  │     │   Loki   │                      │    │
│  │  └─────┬─────┘     │ (400 MB) │                      │    │
│  │        │           └─────┬────┘                      │    │
│  │        ▼                 │                           │    │
│  │  ┌───────────┐          │                            │    │
│  │  │  Grafana  │◀─────────┘                            │    │
│  │  │ (200 MB)  │                                       │    │
│  │  └─────┬─────┘                                       │    │
│  │        │                                             │    │
│  │        ▼                                             │    │
│  │  ┌──────────────┐                                    │    │
│  │  │ Alertmanager │──▶ ntfy (push notifications)       │    │
│  │  │ (30 MB)      │                                    │    │
│  │  └──────────────┘                                    │    │
│  └──────────────────────────────────────────────────────┘    │
└────────────────────────────────────────────────────────────┘

Docker Compose: The Monitoring Stack

# Monitoring services within the main docker-compose.yml
services:
  # ---- Metrics Collection ----
  prometheus:
    image: prom/prometheus:v2.51.0
    restart: unless-stopped
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--storage.tsdb.retention.size=2GB'
      - '--storage.tsdb.wal-compression'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring
      - backend
    deploy:
      resources:
        limits:
          memory: 768M
        reservations:
          memory: 256M

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    command:
      - '--housekeeping_interval=30s'
      - '--docker_only=true'
      - '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,resctrl,process'
      - '--store_container_labels=false'
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 192M
        reservations:
          memory: 64M

  node-exporter:
    image: prom/node-exporter:v1.7.0
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 64M

  # ---- Visualization ----
  grafana:
    image: grafana/grafana:10.4.0
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_INSTALL_PLUGINS: grafana-clock-panel
      GF_SERVER_ROOT_URL: https://grafana.techsaas.cloud
      # Performance tuning
      GF_DATABASE_WAL: 'true'
      GF_DASHBOARDS_MIN_REFRESH_INTERVAL: 10s
      GF_UNIFIED_ALERTING_EVALUATION_TIMEOUT: 30s
    volumes:
      - grafana-data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro
    networks:
      - monitoring
      - web
    labels:
      - traefik.enable=true
      - traefik.http.routers.grafana.rule=Host(`grafana.techsaas.cloud`)
      - traefik.http.routers.grafana.tls.certresolver=letsencrypt
    deploy:
      resources:
        limits:
          memory: 256M
        reservations:
          memory: 128M

  # ---- Log Aggregation ----
  loki:
    image: grafana/loki:2.9.5
    restart: unless-stopped
    volumes:
      - ./monitoring/loki.yml:/etc/loki/local-config.yaml:ro
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 128M

  promtail:
    image: grafana/promtail:2.9.5
    restart: unless-stopped
    volumes:
      - ./monitoring/promtail.yml:/etc/promtail/config.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yml
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 128M
        reservations:
          memory: 32M

  # ---- Alerting ----
  alertmanager:
    image: prom/alertmanager:v0.27.0
    restart: unless-stopped
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 64M

Total memory allocation: 768 + 192 + 64 + 256 + 512 + 128 + 64 = 1,984 MB limits / ~1,400 MB actual usage. That's ~10% of our 14 GB budget for complete observability.

Prometheus Configuration: The Key to Efficiency

The default Prometheus setup would consume 2-3 GB scraping 90 containers. Our configuration keeps it under 650 MB through aggressive metric filtering:

# monitoring/prometheus.yml
global:
  scrape_interval: 30s       # Default 15s — halving frequency halves memory
  evaluation_interval: 30s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/alert-rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

scrape_configs:
  # ---- Container Metrics via cAdvisor ----
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    scrape_interval: 30s
    # THIS IS THE KEY: only keep metrics we actually use
    metric_relabel_configs:
      # Keep only essential container metrics
      - source_labels: [__name__]
        regex: 'container_(cpu_usage_seconds_total|cpu_system_seconds_total|memory_usage_bytes|memory_working_set_bytes|memory_cache|network_receive_bytes_total|network_transmit_bytes_total|fs_usage_bytes|fs_limit_bytes|last_seen|spec_memory_limit_bytes|spec_cpu_quota)'
        action: keep
      # Drop metrics from pause containers and system containers
      - source_labels: [name]
        regex: '(POD|)'
        action: drop
      # Drop high-cardinality labels we don't need
      - regex: '(container_label_.+|id|image)'
        action: labeldrop

  # ---- Host Metrics ----
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
    scrape_interval: 30s
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'node_(cpu_seconds_total|memory_.+|filesystem_.+|disk_.+|network_.+|load.+|boot_time_seconds)'
        action: keep

  # ---- Traefik Metrics ----
  - job_name: 'traefik'
    static_configs:
      - targets: ['traefik:8082']
    scrape_interval: 30s
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'traefik_(entrypoint|router|service)_.+'
        action: keep

  # ---- PostgreSQL Metrics ----
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']
    scrape_interval: 60s    # Database metrics change slowly
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'pg_(stat_user_tables_.+|stat_database_.+|settings_.+|up|stat_bgwriter_.+|replication_.+)'
        action: keep

  # ---- Redis Metrics ----
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    scrape_interval: 30s

Why Metric Filtering Matters

Without filtering, cAdvisor exposes ~500 metrics per container. With 90 containers, that's 45,000 time series from cAdvisor alone. Each time series consumes ~3 KB of Prometheus memory. That's 135 MB just for cAdvisor — before node-exporter, Traefik, PostgreSQL, or any other targets.

With our filtering, cAdvisor exposes ~12 metrics per container = 1,080 time series = ~3.2 MB. We reduced cAdvisor's Prometheus memory footprint by 97% while keeping every metric we actually alert on or dashboard.

The --disable_metrics flag on cAdvisor itself prevents the metrics from even being generated, which also reduces cAdvisor's own CPU and memory usage.

Alert Rules: 23 Conditions That Matter

# monitoring/alert-rules.yml
groups:
  - name: container-health
    interval: 30s
    rules:
      # ---- Critical: Things that page me immediately ----
      - alert: ContainerOOMKilled
        expr: |
          increase(container_oom_kills_total[5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "OOM kill detected in {{ $labels.name }}"
          description: "Container {{ $labels.name }} was OOM killed. Check memory limits."

      - alert: ContainerRestarting
        expr: |
          increase(container_last_seen{name=~".+"}[5m]) == 0
          and on(name) container_last_seen{name=~".+"} > 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Container {{ $labels.name }} appears to be crash-looping"

      - alert: HostMemoryCritical
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host memory below 5% — OOM kills imminent"
          description: "Available: {{ $value | humanizePercentage }}. Immediate action required."

      - alert: PostgresDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL is down — multiple services affected"

      # ---- Warning: Things I should check soon ----
      - alert: ContainerHighMemory
        expr: |
          (
            container_memory_working_set_bytes{name=~".+"}
            / on(name)
            container_spec_memory_limit_bytes{name=~".+", container_spec_memory_limit_bytes!="0"}
          ) > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.name }} at {{ $value | humanizePercentage }} memory"

      - alert: ContainerHighCPU
        expr: |
          rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) > 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.name }} sustained CPU > 80%"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"}
           / node_filesystem_size_bytes{mountpoint="/"}) < 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Root filesystem {{ $value | humanizePercentage }} free"

      - alert: SwapHigh
        expr: |
          (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) > 0.7
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Swap usage at {{ $value | humanizePercentage }}"

      - alert: TraefikHighErrorRate
        expr: |
          sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
          /
          sum(rate(traefik_service_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Traefik 5xx error rate above 5%"

      - alert: PostgresConnectionsHigh
        expr: |
          pg_stat_database_numbackends{datname!~"template.*"}
          / on(datname)
          pg_settings_max_connections > 0.75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max"

      - alert: PrometheusStorageHigh
        expr: |
          prometheus_tsdb_storage_blocks_bytes / (2 * 1024 * 1024 * 1024) > 0.85
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Prometheus storage at {{ $value | humanizePercentage }} of 2GB limit"

Get more insights on DevOps

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Alertmanager: Routing to ntfy

# monitoring/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'ntfy-all'
  routes:
    - match:
        severity: critical
      receiver: 'ntfy-critical'
      repeat_interval: 15m      # Repeat critical alerts every 15 min
    - match:
        severity: warning
      receiver: 'ntfy-all'
      repeat_interval: 4h

receivers:
  - name: 'ntfy-critical'
    webhook_configs:
      - url: 'http://ntfy:80/padc-atlas'
        send_resolved: true
        http_config:
          basic_auth:
            username: atlas
            password_file: /etc/alertmanager/ntfy-password

  - name: 'ntfy-all'
    webhook_configs:
      - url: 'http://ntfy:80/padc-monitoring'
        send_resolved: true

Critical alerts (OOM, PostgreSQL down, host memory critical) go to the padc-atlas topic with high priority — my phone buzzes immediately. Warnings go to padc-monitoring at normal priority.

Loki Configuration: Logs Under Control

# monitoring/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn        # Loki's own logs — keep quiet

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /loki/wal

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 168h              # 7-day retention
  max_query_series: 5000
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 8
  max_entries_limit_per_query: 5000
  max_streams_per_user: 10000
  reject_old_samples: true
  reject_old_samples_max_age: 168h

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: filesystem

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

Promtail: Smart Log Shipping

# monitoring/promtail.yml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push
    batchwait: 1s
    batchsize: 1048576     # 1 MB batches

scrape_configs:
  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log

    pipeline_stages:
      # Parse Docker JSON log format
      - docker: {}

      # Extract container name from path
      - regex:
          source: filename
          expression: '/var/lib/docker/containers/(?P<container_id>[^/]+)/.*'

      # Add container labels
      - labels:
          container_id:

      # Drop noisy health check logs (saves ~40% log volume)
      - match:
          selector: '{job="docker"}'
          stages:
            - regex:
                expression: '(?i)(GET\s+/(health|ready|metrics|ping)\s+HTTP)'
            - metrics:
                health_check_lines_dropped:
                  type: Counter
                  description: "Health check log lines dropped"
                  source: ""
                  config:
                    action: inc
          action: drop

      # Drop debug-level logs in production
      - match:
          selector: '{job="docker"}'
          stages:
            - regex:
                expression: '(?i)^(DEBUG|TRACE)'
          action: drop

      # Rate-limit extremely chatty containers
      - limit:
          rate: 100          # Max 100 lines/sec per stream
          burst: 200
          drop: true

The pipeline stages are where the real savings happen:

Optimization Log Volume Reduction
Drop health check logs ~40%
Drop DEBUG/TRACE logs ~15%
Rate limiting chatty containers ~5-10%
Total reduction ~55-60%

Without these filters, Loki would need 1+ GB of RAM and significantly more disk. With them, it runs comfortably in 400 MB.

Grafana Dashboards

The Container Overview Dashboard

We provision dashboards as code:

# monitoring/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
  - name: 'PADC Dashboards'
    orgId: 1
    folder: 'PADC'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Key PromQL Queries

These are the queries powering our most-used panels:

# Top 10 containers by memory (working set, not cache)
topk(10,
  container_memory_working_set_bytes{name=~".+"}
)

# Memory usage as % of limit
(
  container_memory_working_set_bytes{name=~".+"}
  / on(name)
  container_spec_memory_limit_bytes{name=~".+", container_spec_memory_limit_bytes!="0"}
) * 100

# CPU usage per container (percentage of one core)
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100

# Network I/O per container
rate(container_network_receive_bytes_total{name=~".+"}[5m])
rate(container_network_transmit_bytes_total{name=~".+"}[5m])

# Host memory breakdown
node_memory_MemTotal_bytes
  - node_memory_MemAvailable_bytes  # Used
node_memory_Cached_bytes             # File cache
node_memory_Buffers_bytes            # Buffers
node_memory_SwapTotal_bytes
  - node_memory_SwapFree_bytes       # Swap used

# Disk I/O saturation
rate(node_disk_io_time_seconds_total[5m])  # >1.0 means saturated

# Traefik request rate by service
sum by (service) (
  rate(traefik_service_requests_total[5m])
)

# Traefik error rate by service
sum by (service) (
  rate(traefik_service_requests_total{code=~"5.."}[5m])
)
/
sum by (service) (
  rate(traefik_service_requests_total[5m])
)

# PostgreSQL active connections per database
pg_stat_database_numbackends{datname!~"template.*|postgres"}

# PostgreSQL cache hit ratio (should be >99%)
pg_stat_database_blks_hit{datname!~"template.*"}
/
(
  pg_stat_database_blks_hit{datname!~"template.*"}
  + pg_stat_database_blks_read{datname!~"template.*"}
)

LogQL Queries for Loki

# Error logs across all containers in last hour
{job="docker"} |~ "(?i)(error|exception|fatal|panic)" | line_format "{{.container_name}}: {{.message}}"

# Logs from a specific container
{job="docker", container_name="directus"}

# Count errors per container over time (for graph panel)
sum by (container_name) (
  count_over_time(
    {job="docker"} |~ "(?i)error" [5m]
  )
)

# PostgreSQL slow queries
{job="docker", container_name="postgres"} |~ "duration:" | regexp `duration: (?P<duration>[\d.]+) ms` | duration > 100

# OOM events
{job="docker"} |~ "(?i)(out of memory|oom|killed process)"

The Resource Optimization Playbook

Optimization 1: Scrape Interval

Default: 15s scrape interval
→ 90 containers × 12 metrics × 4 samples/min = 4,320 samples/min
→ Prometheus memory: ~900 MB

Optimized: 30s scrape interval
→ 90 containers × 12 metrics × 2 samples/min = 2,160 samples/min
→ Prometheus memory: ~500 MB

Saved: ~400 MB (44% reduction)

30-second resolution is perfectly adequate for container monitoring. If a container OOMs, you'll know within 30 seconds. That's fast enough.

Optimization 2: Retention Limits

# Prometheus: size-based + time-based retention
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=2GB

# Loki: 7-day retention with compaction
retention_period: 168h

15 days of metrics and 7 days of logs covers most debugging scenarios. Anything older goes to long-term storage (optional S3/MinIO backup) or simply ages out.

Optimization 3: WAL Compression

# Prometheus
--storage.tsdb.wal-compression

# Loki
ingester:
  wal:
    enabled: true

WAL compression reduces Prometheus disk I/O by 30-50% with negligible CPU overhead.

Optimization 4: cAdvisor Metric Disabling

command:
  - '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,resctrl,process'
  - '--housekeeping_interval=30s'
  - '--docker_only=true'
  - '--store_container_labels=false'

This reduces cAdvisor's own memory from 300+ MB to under 150 MB, and reduces the metrics it exposes by 80%.

Operational Runbook

"Something Is Slow" Investigation

Free Resource

CI/CD Pipeline Blueprint

Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.

Get the Blueprint
# 1. Check host resources
curl -s localhost:9090/api/v1/query?query=node_memory_MemAvailable_bytes | jq
curl -s localhost:9090/api/v1/query?query=node_load1 | jq

# 2. Find top memory consumers
curl -s 'localhost:9090/api/v1/query?query=topk(5,container_memory_working_set_bytes)' | jq '.data.result[] | {name: .metric.name, bytes: .value[1]}'

# 3. Find high-CPU containers
curl -s 'localhost:9090/api/v1/query?query=topk(5,rate(container_cpu_usage_seconds_total[5m]))' | jq

# 4. Check PostgreSQL
curl -s 'localhost:9090/api/v1/query?query=pg_stat_database_numbackends' | jq

# 5. Check Traefik for error spikes
curl -s 'localhost:9090/api/v1/query?query=sum+by+(service)(rate(traefik_service_requests_total{code=~"5.."}[5m]))' | jq

"Container Keeps Dying" Investigation

# Check if it's OOM
docker inspect <container> --format '{{.State.OOMKilled}}'

# Check recent logs
curl -s 'localhost:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={container_name="<name>"}' \
  --data-urlencode 'limit=50' | jq '.data.result[0].values[][1]'

# Check memory trajectory (was it creeping up?)
curl -s 'localhost:9090/api/v1/query_range?query=container_memory_working_set_bytes{name="<name>"}&start=2026-03-19T00:00:00Z&end=2026-03-19T23:59:59Z&step=5m' | jq

What We've Caught

Real incidents detected by this monitoring stack:

  1. Postiz memory leak — Memory grew linearly from 200 MB to 512 MB limit over 3 days, then OOM. Alert fired, we added a daily restart as a stopgap while investigating the root cause.

  2. PostgreSQL connection exhaustion — n8n was leaking connections during workflow failures. The PostgresConnectionsHigh alert caught it at 75% before it hit the hard limit.

  3. Loki filling disk — Before we added retention and the Promtail filters, Loki grew 2 GB/day. The DiskSpaceLow alert saved us from a full root filesystem.

  4. Swap death spiral — During a traffic spike, swap grew to 15 GB and the system became unresponsive. We added the SwapHigh alert and now catch this before it cascades.

  5. Silent container exits — Excalidraw's nginx process would occasionally exit cleanly (code 0). unless-stopped restart policy didn't restart it. The ContainerRestarting alert caught it, and we switched to restart: always.

The Bottom Line

Monitoring 90+ containers on 14 GB of RAM is a constraint that forces good engineering. You can't throw Datadog at it and forget it — every metric, every log line, every alert has to earn its resource footprint.

The stack described here — Prometheus, Grafana, Loki, Promtail, cAdvisor, node-exporter, and Alertmanager — runs on 1.5 GB total while providing complete visibility into every container, every host metric, and every log line.

The key optimizations: aggressive metric filtering (keep only what you dashboard and alert on), scrape interval tuning (30s is fine), log pipeline filtering (drop health checks and debug logs), and memory limits on every monitoring container.

Don't monitor everything. Monitor what matters. And make sure your monitoring stack doesn't consume more resources than the services it's watching.

#prometheus#grafana#loki#docker#monitoring#observability#alerting

Related Service

Platform Engineering

From CI/CD pipelines to service meshes, we create golden paths for your developers.

Need help with devops?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.