How We Monitor 90+ Docker Containers with Prometheus, Grafana, and Loki
A production-tested guide to monitoring 90+ Docker containers on constrained hardware. Covers Prometheus metric collection, Grafana dashboards, Loki log aggregation, alerting via Alertmanager, and the specific optimizations that keep our monitoring stack under 1.5 GB of RAM.
The Monitoring Problem at Scale on Constrained Hardware
When you run 90+ Docker containers on a single server with 14 GB of RAM, monitoring becomes both critically important and resource-constrained. You can't afford to not monitor — a single runaway container can OOM-kill your entire stack. But you also can't afford to dedicate 4 GB to monitoring when your applications need that memory to run.
Our monitoring stack — Prometheus, Grafana, Loki, Promtail, cAdvisor, and node-exporter — runs on less than 1.5 GB total. It monitors every container, collects 47,000+ active time series, aggregates logs from all 90+ services, and alerts on 23 conditions.
This is the exact configuration we run in production.
Architecture
┌────────────────────────────────────────────────────────────┐
│ Docker Host (14 GB RAM) │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ 90+ Application Containers │ │
│ │ (postgres, redis, directus, gitea, n8n, ...) │ │
│ └──────────────────┬──────────────────────────────────┘ │
│ │ Docker socket + log files │
│ ┌──────────────────┼──────────────────────────────────┐ │
│ │ Monitoring Stack│(~1.5 GB total) │ │
│ │ │ │ │
│ │ ┌───────────┐ │ ┌───────────┐ ┌──────────────┐ │ │
│ │ │ cAdvisor │──┘ │ Promtail │ │ node_exporter│ │ │
│ │ │ (150 MB) │ │ (80 MB) │ │ (20 MB) │ │ │
│ │ └─────┬─────┘ └─────┬─────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ │ │ │
│ │ ┌───────────┐ ┌──────────┐ │ │ │
│ │ │Prometheus │◀────┤ │◀─────────┘ │ │
│ │ │ (650 MB) │ │ Loki │ │ │
│ │ └─────┬─────┘ │ (400 MB) │ │ │
│ │ │ └─────┬────┘ │ │
│ │ ▼ │ │ │
│ │ ┌───────────┐ │ │ │
│ │ │ Grafana │◀─────────┘ │ │
│ │ │ (200 MB) │ │ │
│ │ └─────┬─────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────┐ │ │
│ │ │ Alertmanager │──▶ ntfy (push notifications) │ │
│ │ │ (30 MB) │ │ │
│ │ └──────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
Docker Compose: The Monitoring Stack
# Monitoring services within the main docker-compose.yml
services:
# ---- Metrics Collection ----
prometheus:
image: prom/prometheus:v2.51.0
restart: unless-stopped
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--storage.tsdb.retention.size=2GB'
- '--storage.tsdb.wal-compression'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
networks:
- monitoring
- backend
deploy:
resources:
limits:
memory: 768M
reservations:
memory: 256M
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
restart: unless-stopped
privileged: true
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
command:
- '--housekeeping_interval=30s'
- '--docker_only=true'
- '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,resctrl,process'
- '--store_container_labels=false'
networks:
- monitoring
deploy:
resources:
limits:
memory: 192M
reservations:
memory: 64M
node-exporter:
image: prom/node-exporter:v1.7.0
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
deploy:
resources:
limits:
memory: 64M
# ---- Visualization ----
grafana:
image: grafana/grafana:10.4.0
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_INSTALL_PLUGINS: grafana-clock-panel
GF_SERVER_ROOT_URL: https://grafana.techsaas.cloud
# Performance tuning
GF_DATABASE_WAL: 'true'
GF_DASHBOARDS_MIN_REFRESH_INTERVAL: 10s
GF_UNIFIED_ALERTING_EVALUATION_TIMEOUT: 30s
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro
networks:
- monitoring
- web
labels:
- traefik.enable=true
- traefik.http.routers.grafana.rule=Host(`grafana.techsaas.cloud`)
- traefik.http.routers.grafana.tls.certresolver=letsencrypt
deploy:
resources:
limits:
memory: 256M
reservations:
memory: 128M
# ---- Log Aggregation ----
loki:
image: grafana/loki:2.9.5
restart: unless-stopped
volumes:
- ./monitoring/loki.yml:/etc/loki/local-config.yaml:ro
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
networks:
- monitoring
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 128M
promtail:
image: grafana/promtail:2.9.5
restart: unless-stopped
volumes:
- ./monitoring/promtail.yml:/etc/promtail/config.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yml
networks:
- monitoring
deploy:
resources:
limits:
memory: 128M
reservations:
memory: 32M
# ---- Alerting ----
alertmanager:
image: prom/alertmanager:v0.27.0
restart: unless-stopped
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
networks:
- monitoring
deploy:
resources:
limits:
memory: 64M
Total memory allocation: 768 + 192 + 64 + 256 + 512 + 128 + 64 = 1,984 MB limits / ~1,400 MB actual usage. That's ~10% of our 14 GB budget for complete observability.
Prometheus Configuration: The Key to Efficiency
The default Prometheus setup would consume 2-3 GB scraping 90 containers. Our configuration keeps it under 650 MB through aggressive metric filtering:
# monitoring/prometheus.yml
global:
scrape_interval: 30s # Default 15s — halving frequency halves memory
evaluation_interval: 30s
scrape_timeout: 10s
rule_files:
- /etc/prometheus/alert-rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
# ---- Container Metrics via cAdvisor ----
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
scrape_interval: 30s
# THIS IS THE KEY: only keep metrics we actually use
metric_relabel_configs:
# Keep only essential container metrics
- source_labels: [__name__]
regex: 'container_(cpu_usage_seconds_total|cpu_system_seconds_total|memory_usage_bytes|memory_working_set_bytes|memory_cache|network_receive_bytes_total|network_transmit_bytes_total|fs_usage_bytes|fs_limit_bytes|last_seen|spec_memory_limit_bytes|spec_cpu_quota)'
action: keep
# Drop metrics from pause containers and system containers
- source_labels: [name]
regex: '(POD|)'
action: drop
# Drop high-cardinality labels we don't need
- regex: '(container_label_.+|id|image)'
action: labeldrop
# ---- Host Metrics ----
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
scrape_interval: 30s
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_(cpu_seconds_total|memory_.+|filesystem_.+|disk_.+|network_.+|load.+|boot_time_seconds)'
action: keep
# ---- Traefik Metrics ----
- job_name: 'traefik'
static_configs:
- targets: ['traefik:8082']
scrape_interval: 30s
metric_relabel_configs:
- source_labels: [__name__]
regex: 'traefik_(entrypoint|router|service)_.+'
action: keep
# ---- PostgreSQL Metrics ----
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
scrape_interval: 60s # Database metrics change slowly
metric_relabel_configs:
- source_labels: [__name__]
regex: 'pg_(stat_user_tables_.+|stat_database_.+|settings_.+|up|stat_bgwriter_.+|replication_.+)'
action: keep
# ---- Redis Metrics ----
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
scrape_interval: 30s
Why Metric Filtering Matters
Without filtering, cAdvisor exposes ~500 metrics per container. With 90 containers, that's 45,000 time series from cAdvisor alone. Each time series consumes ~3 KB of Prometheus memory. That's 135 MB just for cAdvisor — before node-exporter, Traefik, PostgreSQL, or any other targets.
With our filtering, cAdvisor exposes ~12 metrics per container = 1,080 time series = ~3.2 MB. We reduced cAdvisor's Prometheus memory footprint by 97% while keeping every metric we actually alert on or dashboard.
The --disable_metrics flag on cAdvisor itself prevents the metrics from even being generated, which also reduces cAdvisor's own CPU and memory usage.
Alert Rules: 23 Conditions That Matter
# monitoring/alert-rules.yml
groups:
- name: container-health
interval: 30s
rules:
# ---- Critical: Things that page me immediately ----
- alert: ContainerOOMKilled
expr: |
increase(container_oom_kills_total[5m]) > 0
labels:
severity: critical
annotations:
summary: "OOM kill detected in {{ $labels.name }}"
description: "Container {{ $labels.name }} was OOM killed. Check memory limits."
- alert: ContainerRestarting
expr: |
increase(container_last_seen{name=~".+"}[5m]) == 0
and on(name) container_last_seen{name=~".+"} > 0
for: 3m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} appears to be crash-looping"
- alert: HostMemoryCritical
expr: |
(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Host memory below 5% — OOM kills imminent"
description: "Available: {{ $value | humanizePercentage }}. Immediate action required."
- alert: PostgresDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down — multiple services affected"
# ---- Warning: Things I should check soon ----
- alert: ContainerHighMemory
expr: |
(
container_memory_working_set_bytes{name=~".+"}
/ on(name)
container_spec_memory_limit_bytes{name=~".+", container_spec_memory_limit_bytes!="0"}
) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $labels.name }} at {{ $value | humanizePercentage }} memory"
- alert: ContainerHighCPU
expr: |
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) > 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.name }} sustained CPU > 80%"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) < 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Root filesystem {{ $value | humanizePercentage }} free"
- alert: SwapHigh
expr: |
(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) > 0.7
for: 15m
labels:
severity: warning
annotations:
summary: "Swap usage at {{ $value | humanizePercentage }}"
- alert: TraefikHighErrorRate
expr: |
sum(rate(traefik_service_requests_total{code=~"5.."}[5m]))
/
sum(rate(traefik_service_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Traefik 5xx error rate above 5%"
- alert: PostgresConnectionsHigh
expr: |
pg_stat_database_numbackends{datname!~"template.*"}
/ on(datname)
pg_settings_max_connections > 0.75
for: 10m
labels:
severity: warning
annotations:
summary: "PostgreSQL connections at {{ $value | humanizePercentage }} of max"
- alert: PrometheusStorageHigh
expr: |
prometheus_tsdb_storage_blocks_bytes / (2 * 1024 * 1024 * 1024) > 0.85
for: 1h
labels:
severity: warning
annotations:
summary: "Prometheus storage at {{ $value | humanizePercentage }} of 2GB limit"
Get more insights on DevOps
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Alertmanager: Routing to ntfy
# monitoring/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'ntfy-all'
routes:
- match:
severity: critical
receiver: 'ntfy-critical'
repeat_interval: 15m # Repeat critical alerts every 15 min
- match:
severity: warning
receiver: 'ntfy-all'
repeat_interval: 4h
receivers:
- name: 'ntfy-critical'
webhook_configs:
- url: 'http://ntfy:80/padc-atlas'
send_resolved: true
http_config:
basic_auth:
username: atlas
password_file: /etc/alertmanager/ntfy-password
- name: 'ntfy-all'
webhook_configs:
- url: 'http://ntfy:80/padc-monitoring'
send_resolved: true
Critical alerts (OOM, PostgreSQL down, host memory critical) go to the padc-atlas topic with high priority — my phone buzzes immediately. Warnings go to padc-monitoring at normal priority.
Loki Configuration: Logs Under Control
# monitoring/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: warn # Loki's own logs — keep quiet
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
max_transfer_retries: 0
wal:
enabled: true
dir: /loki/wal
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 168h # 7-day retention
max_query_series: 5000
ingestion_rate_mb: 4
ingestion_burst_size_mb: 8
max_entries_limit_per_query: 5000
max_streams_per_user: 10000
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
chunk_store_config:
max_look_back_period: 168h
table_manager:
retention_deletes_enabled: true
retention_period: 168h
Promtail: Smart Log Shipping
# monitoring/promtail.yml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
batchwait: 1s
batchsize: 1048576 # 1 MB batches
scrape_configs:
- job_name: docker
static_configs:
- targets:
- localhost
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
# Parse Docker JSON log format
- docker: {}
# Extract container name from path
- regex:
source: filename
expression: '/var/lib/docker/containers/(?P<container_id>[^/]+)/.*'
# Add container labels
- labels:
container_id:
# Drop noisy health check logs (saves ~40% log volume)
- match:
selector: '{job="docker"}'
stages:
- regex:
expression: '(?i)(GET\s+/(health|ready|metrics|ping)\s+HTTP)'
- metrics:
health_check_lines_dropped:
type: Counter
description: "Health check log lines dropped"
source: ""
config:
action: inc
action: drop
# Drop debug-level logs in production
- match:
selector: '{job="docker"}'
stages:
- regex:
expression: '(?i)^(DEBUG|TRACE)'
action: drop
# Rate-limit extremely chatty containers
- limit:
rate: 100 # Max 100 lines/sec per stream
burst: 200
drop: true
The pipeline stages are where the real savings happen:
| Optimization | Log Volume Reduction |
|---|---|
| Drop health check logs | ~40% |
| Drop DEBUG/TRACE logs | ~15% |
| Rate limiting chatty containers | ~5-10% |
| Total reduction | ~55-60% |
Without these filters, Loki would need 1+ GB of RAM and significantly more disk. With them, it runs comfortably in 400 MB.
Grafana Dashboards
The Container Overview Dashboard
We provision dashboards as code:
# monitoring/grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'PADC Dashboards'
orgId: 1
folder: 'PADC'
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: true
Key PromQL Queries
These are the queries powering our most-used panels:
# Top 10 containers by memory (working set, not cache)
topk(10,
container_memory_working_set_bytes{name=~".+"}
)
# Memory usage as % of limit
(
container_memory_working_set_bytes{name=~".+"}
/ on(name)
container_spec_memory_limit_bytes{name=~".+", container_spec_memory_limit_bytes!="0"}
) * 100
# CPU usage per container (percentage of one core)
rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) * 100
# Network I/O per container
rate(container_network_receive_bytes_total{name=~".+"}[5m])
rate(container_network_transmit_bytes_total{name=~".+"}[5m])
# Host memory breakdown
node_memory_MemTotal_bytes
- node_memory_MemAvailable_bytes # Used
node_memory_Cached_bytes # File cache
node_memory_Buffers_bytes # Buffers
node_memory_SwapTotal_bytes
- node_memory_SwapFree_bytes # Swap used
# Disk I/O saturation
rate(node_disk_io_time_seconds_total[5m]) # >1.0 means saturated
# Traefik request rate by service
sum by (service) (
rate(traefik_service_requests_total[5m])
)
# Traefik error rate by service
sum by (service) (
rate(traefik_service_requests_total{code=~"5.."}[5m])
)
/
sum by (service) (
rate(traefik_service_requests_total[5m])
)
# PostgreSQL active connections per database
pg_stat_database_numbackends{datname!~"template.*|postgres"}
# PostgreSQL cache hit ratio (should be >99%)
pg_stat_database_blks_hit{datname!~"template.*"}
/
(
pg_stat_database_blks_hit{datname!~"template.*"}
+ pg_stat_database_blks_read{datname!~"template.*"}
)
LogQL Queries for Loki
# Error logs across all containers in last hour
{job="docker"} |~ "(?i)(error|exception|fatal|panic)" | line_format "{{.container_name}}: {{.message}}"
# Logs from a specific container
{job="docker", container_name="directus"}
# Count errors per container over time (for graph panel)
sum by (container_name) (
count_over_time(
{job="docker"} |~ "(?i)error" [5m]
)
)
# PostgreSQL slow queries
{job="docker", container_name="postgres"} |~ "duration:" | regexp `duration: (?P<duration>[\d.]+) ms` | duration > 100
# OOM events
{job="docker"} |~ "(?i)(out of memory|oom|killed process)"
The Resource Optimization Playbook
Optimization 1: Scrape Interval
Default: 15s scrape interval
→ 90 containers × 12 metrics × 4 samples/min = 4,320 samples/min
→ Prometheus memory: ~900 MB
Optimized: 30s scrape interval
→ 90 containers × 12 metrics × 2 samples/min = 2,160 samples/min
→ Prometheus memory: ~500 MB
Saved: ~400 MB (44% reduction)
30-second resolution is perfectly adequate for container monitoring. If a container OOMs, you'll know within 30 seconds. That's fast enough.
Optimization 2: Retention Limits
# Prometheus: size-based + time-based retention
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=2GB
# Loki: 7-day retention with compaction
retention_period: 168h
15 days of metrics and 7 days of logs covers most debugging scenarios. Anything older goes to long-term storage (optional S3/MinIO backup) or simply ages out.
Optimization 3: WAL Compression
# Prometheus
--storage.tsdb.wal-compression
# Loki
ingester:
wal:
enabled: true
WAL compression reduces Prometheus disk I/O by 30-50% with negligible CPU overhead.
Optimization 4: cAdvisor Metric Disabling
command:
- '--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory,resctrl,process'
- '--housekeeping_interval=30s'
- '--docker_only=true'
- '--store_container_labels=false'
This reduces cAdvisor's own memory from 300+ MB to under 150 MB, and reduces the metrics it exposes by 80%.
Operational Runbook
"Something Is Slow" Investigation
Free Resource
CI/CD Pipeline Blueprint
Our battle-tested pipeline template covering build, test, security scan, staging, and zero-downtime deployment stages.
# 1. Check host resources
curl -s localhost:9090/api/v1/query?query=node_memory_MemAvailable_bytes | jq
curl -s localhost:9090/api/v1/query?query=node_load1 | jq
# 2. Find top memory consumers
curl -s 'localhost:9090/api/v1/query?query=topk(5,container_memory_working_set_bytes)' | jq '.data.result[] | {name: .metric.name, bytes: .value[1]}'
# 3. Find high-CPU containers
curl -s 'localhost:9090/api/v1/query?query=topk(5,rate(container_cpu_usage_seconds_total[5m]))' | jq
# 4. Check PostgreSQL
curl -s 'localhost:9090/api/v1/query?query=pg_stat_database_numbackends' | jq
# 5. Check Traefik for error spikes
curl -s 'localhost:9090/api/v1/query?query=sum+by+(service)(rate(traefik_service_requests_total{code=~"5.."}[5m]))' | jq
"Container Keeps Dying" Investigation
# Check if it's OOM
docker inspect <container> --format '{{.State.OOMKilled}}'
# Check recent logs
curl -s 'localhost:3100/loki/api/v1/query_range' \
--data-urlencode 'query={container_name="<name>"}' \
--data-urlencode 'limit=50' | jq '.data.result[0].values[][1]'
# Check memory trajectory (was it creeping up?)
curl -s 'localhost:9090/api/v1/query_range?query=container_memory_working_set_bytes{name="<name>"}&start=2026-03-19T00:00:00Z&end=2026-03-19T23:59:59Z&step=5m' | jq
What We've Caught
Real incidents detected by this monitoring stack:
Postiz memory leak — Memory grew linearly from 200 MB to 512 MB limit over 3 days, then OOM. Alert fired, we added a daily restart as a stopgap while investigating the root cause.
PostgreSQL connection exhaustion — n8n was leaking connections during workflow failures. The
PostgresConnectionsHighalert caught it at 75% before it hit the hard limit.Loki filling disk — Before we added retention and the Promtail filters, Loki grew 2 GB/day. The
DiskSpaceLowalert saved us from a full root filesystem.Swap death spiral — During a traffic spike, swap grew to 15 GB and the system became unresponsive. We added the
SwapHighalert and now catch this before it cascades.Silent container exits — Excalidraw's nginx process would occasionally exit cleanly (code 0).
unless-stoppedrestart policy didn't restart it. TheContainerRestartingalert caught it, and we switched torestart: always.
The Bottom Line
Monitoring 90+ containers on 14 GB of RAM is a constraint that forces good engineering. You can't throw Datadog at it and forget it — every metric, every log line, every alert has to earn its resource footprint.
The stack described here — Prometheus, Grafana, Loki, Promtail, cAdvisor, node-exporter, and Alertmanager — runs on 1.5 GB total while providing complete visibility into every container, every host metric, and every log line.
The key optimizations: aggressive metric filtering (keep only what you dashboard and alert on), scrape interval tuning (30s is fine), log pipeline filtering (drop health checks and debug logs), and memory limits on every monitoring container.
Don't monitor everything. Monitor what matters. And make sure your monitoring stack doesn't consume more resources than the services it's watching.
Related Service
Platform Engineering
From CI/CD pipelines to service meshes, we create golden paths for your developers.
Need help with devops?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.