5 Grafana Alerts That Actually Prevent Outages
Your team gets 500 alerts a day. Maybe more. PagerDuty buzzes at 3 AM because CPU hit 91% on a box that autoscales anyway. Slack channels fill with disk usage warnings on ephemeral volumes. Your on-ca
# 5 Grafana Alerts That Actually Prevent Outages
Your team gets 500 alerts a day. Maybe more. PagerDuty buzzes at 3 AM because CPU hit 91% on a box that autoscales anyway. Slack channels fill with disk usage warnings on ephemeral volumes. Your on-call engineer mutes the channel, goes back to sleep, and misses the one alert that actually mattered — the connection pool that silently filled up and brought down your payment gateway at 9 AM when traffic peaked.
This is alert fatigue, and it is the single biggest operational risk in modern infrastructure. Not the lack of monitoring — the excess of it. Research from the Cloud Native Computing Foundation shows that 95% of alerts in a typical Prometheus + Grafana stack are noise. They fire on lagging indicators, on thresholds that made sense three years ago, on metrics that tell you what already happened instead of what is about to happen.
If you are running infrastructure for a SaaS product — whether you are competing with Zoho's suite in the productivity space, building fintech rails like Razorpay, or scaling a customer platform like Freshworks — the question is not "do we have enough alerts?" It is "do we have the right five?"
Here are the five Grafana alerts that actually predict outages 15 to 30 minutes before they happen, along with the PromQL expressions, implementation YAML, and the default alerts you should delete today.
---
Alert 1: Error Budget Burn Rate
SLO-based alerting is the single most important shift in modern operations. Instead of alerting on individual failures, you alert on the rate at which you are consuming your error budget. If your SLO is 99.9% availability over 30 days, you have a budget of 43.2 minutes of downtime. The question becomes: how fast are you burning through that budget right now?
A burn rate of 1x means you will exactly exhaust your budget by the end of the window. A burn rate of 14.4x means you will exhaust it in 1 hour. The magic threshold for a page-worthy alert: a burn rate above 6x sustained for 5 minutes. That means you will breach your SLO within the next few hours unless someone intervenes.
PromQL Expression:
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) / (1 - 0.999) > 6This calculates the current error rate as a fraction of your error budget (1 - SLO target). When the ratio exceeds 6, your burn rate is dangerously high.
Why this works: Traditional error rate alerts (">1% errors") fire constantly during deploys, during traffic spikes, during test runs. Burn rate contextualizes errors against your actual reliability target. A 2% error rate during a 10-minute deploy window barely dents a monthly budget. The same 2% sustained for an hour is catastrophic.
For teams operating under GDPR — and if you serve EU customers, you do — SLO breaches are not just operational incidents. Article 32 requires "appropriate technical measures" to ensure availability. Demonstrating SLO-based alerting with documented burn rate thresholds is concrete evidence of compliance during a DPA audit.
---
Alert 2: Connection Pool Saturation
Every production outage post-mortem I have read in the last three years follows the same pattern: the application did not crash because of a bug. It crashed because it ran out of database connections. PostgreSQL's default max_connections is 100. Your connection pooler — PgBouncer, pgpool, or the built-in pool in your ORM — has a limit. When that limit is reached, new requests queue, then timeout, then fail.
The critical insight: connection pool usage is a leading indicator. By the time you see "FATAL: too many connections" in your PostgreSQL logs, your users have been experiencing degraded performance for 15 to 20 minutes. The connection pool fills gradually, and the last 10% of capacity is where latency explodes.
PromQL for PostgreSQL (via postgres_exporter):
(
sum by (instance) (pg_stat_activity_count{state!="idle"})
/
sum by (instance) (pg_settings_max_connections)
) > 0.75PromQL for Redis:
(
redis_connected_clients
/
redis_config_maxclients
) > 0.80Alert when PostgreSQL active connections exceed 75% of max, or Redis clients exceed 80% of maxclients. At these thresholds, you have a 15 to 20 minute window to act — scale horizontally, kill idle connections, or identify the runaway query.
Why this matters for your stack: If you are using Razorpay's payment gateway or similar transactional APIs, a connection pool exhaustion on your side means failed payment callbacks. Your customers see "payment failed" while Razorpay's webhook retries pile up. You do not just lose the transaction — you lose trust. Connection pool monitoring is the cheapest insurance against payment infrastructure failures.
---
Alert 3: Disk I/O Latency Spike
Everyone monitors disk usage. Disk usage is a lagging indicator — by the time your disk is 90% full, you have been degraded for days. The leading indicator is I/O latency: how long each read or write operation takes to complete.
When I/O latency spikes, it means the kernel's I/O scheduler is overwhelmed. This happens before OOM kills, before swap thrashing, before the "disk full" alert. On cloud instances with burst I/O credits (AWS gp3, Azure Premium SSD), a latency spike often means you have exhausted your burst allowance and are now running at baseline IOPS. Your application does not know this — it just gets slower.
PromQL (via node_exporter):
rate(node_disk_read_time_seconds_total{device!~"dm-.*"}[5m])
/
rate(node_disk_reads_completed_total{device!~"dm-.*"}[5m])
> 0.05This calculates the average read latency per operation. When it exceeds 50ms, something is seriously wrong. Normal SSD latency is under 1ms. Even under heavy load, anything above 10ms warrants investigation. At 50ms, your database queries are already timing out.
Write latency variant:
rate(node_disk_write_time_seconds_total{device!~"dm-.*"}[5m])
/
rate(node_disk_writes_completed_total{device!~"dm-.*"}[5m])
> 0.1Write latency tolerance is slightly higher (100ms threshold) because writes are often batched and fsync'd. But sustained write latency above 100ms means your WAL (write-ahead log) is falling behind, and data durability is at risk.
Why not disk usage? A 500GB disk at 45% usage with 200ms I/O latency is a ticking bomb. The same disk at 85% usage with 0.5ms latency is fine — you have time to expand. Latency tells you about the current health of the I/O subsystem. Usage tells you about the future capacity. You need both, but only latency is a leading indicator of imminent outage.
---
Alert 4: Memory Pressure Trend
Current memory usage is nearly useless as an alert metric. A Java application sitting at 85% heap usage might be perfectly healthy — the garbage collector will reclaim memory when needed. The same application at 60% but growing at 3% per hour has a memory leak and will OOM in 13 hours.
The correct metric is the rate of change. If available memory is decreasing linearly, you can predict exactly when you will hit zero and alert accordingly.
PromQL Expression:
predict_linear(
node_memory_MemAvailable_bytes[1h], 4 * 3600
) < 0This uses Prometheus's predict_linear function to project current memory trends 4 hours into the future. If the projection says you will have zero available memory within 4 hours, the alert fires. This gives your team a comfortable window to investigate and remediate — restart a leaking process, trigger a deployment rollback, or scale the instance.
For container environments:
predict_linear(
container_memory_working_set_bytes{container!="POD", container!=""}[1h],
2 * 3600
) > on(container, namespace)
kube_pod_container_resource_limits{resource="memory"}This predicts whether a container's working set will exceed its memory limit within 2 hours. In Kubernetes, exceeding the memory limit means an OOM kill — no graceful shutdown, no connection draining, just a dead pod. Predicting this 2 hours ahead lets you adjust limits, fix the leak, or at least prepare for the restart.
The growth rate calculation:
deriv(node_memory_MemAvailable_bytes[30m]) < -50 * 1024 * 1024This fires when available memory is decreasing at more than 50MB per 30 minutes. Adjust the threshold based on your instance size — 50MB/30min is aggressive for a 64GB server but critical for a 4GB instance.
---
Alert 5: TLS Certificate Expiry
This sounds mundane until your wildcard certificate expires at 2 AM on a Saturday and every customer sees "Your connection is not private" in their browser. Let's Encrypt certificates expire every 90 days. Automated renewal usually works — until it does not. A DNS provider API change, a firewall rule update, a misconfigured certbot cron — any of these silently break renewal, and you will not know until the certificate expires.
PromQL (via blackbox_exporter or ssl_exporter):
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14This calculates days until certificate expiry. Alert at 14 days (warning), 7 days (high), and 3 days (critical). Three escalating thresholds give your team multiple chances to catch renewal failures before they become customer-facing incidents.
Multi-tier alert configuration:
# Warning - 14 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
# High - 7 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
# Critical - 3 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 3GDPR and compliance angle: For EU-facing services, an expired TLS certificate is not just a UX problem. Under GDPR Article 32(1)(b), you are required to ensure "the ongoing confidentiality of processing systems and services." An expired certificate means traffic could be intercepted. This is a reportable security incident if it affects personal data in transit. The cost of a 14-day-ahead alert is zero. The cost of a GDPR breach notification is not.
---
Implementation: Grafana Alert Rules YAML
Here is the complete Grafana provisioning configuration for all five alerts. Drop this into your provisioning/alerting/ directory:
apiVersion: 1
groups:
- orgId: 1
name: proactive-outage-prevention
folder: Infrastructure
interval: 60s
rules:
- uid: error-budget-burn
title: "Error Budget Burn Rate Critical"
condition: burn_rate
data:
- refId: burn_rate
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])))
/ (1 - 0.999) > 6
intervalMs: 15000
maxDataPoints: 43200
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "Error budget burn rate exceeds 6x - SLO breach imminent"
runbook_url: "https://wiki.internal/runbooks/slo-breach"
- uid: conn-pool-postgres
title: "PostgreSQL Connection Pool > 75%"
condition: pg_pool
data:
- refId: pg_pool
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
(sum by (instance) (pg_stat_activity_count{state!="idle"})
/ sum by (instance) (pg_settings_max_connections)) > 0.75
intervalMs: 15000
maxDataPoints: 43200
for: 3m
labels:
severity: warning
team: database
annotations:
summary: "PostgreSQL active connections at {{ $value | humanizePercentage }}"
- uid: disk-io-latency
title: "Disk I/O Read Latency > 50ms"
condition: io_lat
data:
- refId: io_lat
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
rate(node_disk_read_time_seconds_total{device!~"dm-.*"}[5m])
/ rate(node_disk_reads_completed_total{device!~"dm-.*"}[5m]) > 0.05
intervalMs: 15000
maxDataPoints: 43200
for: 5m
labels:
severity: warning
team: infrastructure
annotations:
summary: "Disk read latency elevated on {{ $labels.instance }}"
- uid: memory-pressure
title: "Memory Exhaustion Predicted Within 4h"
condition: mem_predict
data:
- refId: mem_predict
relativeTimeRange:
from: 3600
to: 0
datasourceUid: prometheus
model:
expr: |
predict_linear(node_memory_MemAvailable_bytes[1h], 4 * 3600) < 0
intervalMs: 60000
maxDataPoints: 43200
for: 10m
labels:
severity: critical
team: infrastructure
annotations:
summary: "Memory on {{ $labels.instance }} projected to exhaust within 4 hours"
- uid: tls-expiry-warning
title: "TLS Certificate Expiring Within 14 Days"
condition: tls_exp
data:
- refId: tls_exp
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
intervalMs: 60000
maxDataPoints: 43200
for: 5m
labels:
severity: warning
team: security
annotations:
summary: "TLS cert for {{ $labels.instance }} expires in {{ $value | humanize }} days"
- uid: tls-expiry-critical
title: "TLS Certificate Expiring Within 3 Days"
condition: tls_crit
data:
- refId: tls_crit
relativeTimeRange:
from: 300
to: 0
datasourceUid: prometheus
model:
expr: |
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 3
intervalMs: 60000
maxDataPoints: 43200
for: 5m
labels:
severity: critical
team: security
annotations:
summary: "URGENT: TLS cert for {{ $labels.instance }} expires in {{ $value | humanize }} days"---
What to Remove: 10 Default Alerts That Cause Fatigue
These are the alerts most teams inherit from starter dashboards, Helm chart defaults, or "best practice" templates. They fire constantly and prevent nothing:
1. CPU > 90% for 5 minutes — On autoscaling infrastructure, this is normal. On single-instance, CPU spikes are transient and self-resolving 90% of the time. 2. Disk usage > 80% — Tells you nothing about urgency. 80% on a 2TB volume means 400GB free. Alert on growth rate instead. 3. Memory usage > 85% — Java, Go, and .NET all aggressively use available memory. This fires constantly on healthy JVMs. 4. Pod restart count > 0 — CrashLoopBackOff is visible in your deployment dashboard. A single restart from a liveness probe is not an incident. 5. HTTP 5xx count > 0 — A single 5xx from a health check timeout is not an outage. Use error budget burn rate instead. 6. Container CPU throttled — If you set CPU limits (which you should reconsider), throttling is expected behavior, not an incident. 7. Node not ready — In a cluster with autoscaling node pools, nodes come and go. Alert on sustained unavailability, not transient states. 8. etcd leader changes — In a healthy cluster, leader elections happen. Alert on election failures or sustained leaderlessness, not the election itself. 9. Ingress 4xx rate > 10% — Bots, scanners, and misconfigured clients generate 404s and 401s constantly. This is internet background noise. 10. Deployment replicas mismatch — During rolling updates, replica counts intentionally mismatch. Alert only if the mismatch persists beyond the rollout timeout.
The rule: If an alert fired more than 20 times in the last 30 days and was never actioned, delete it. It is training your team to ignore alerts.
---
Common Mistakes
1. Alerting on symptoms instead of causes. "Response time > 2s" tells you something is slow. It does not tell you what is slow or why. Your five core alerts target the root causes — resource exhaustion, error budget depletion, certificate expiry. Symptom-based alerts belong in dashboards, not in PagerDuty.
2. Setting thresholds without historical analysis. A 75% connection pool threshold assumes your normal baseline is well below that. If your application normally runs at 70% pool utilization, you will get constant alerts. Before setting any threshold, query at least 30 days of historical data: quantile_over_time(0.95, your_metric[30d]). Set your alert threshold at the 99th percentile of normal operation plus a 15% buffer.
3. No runbook linked to the alert. An alert without a runbook is a notification, not an alert. Every alert rule in the YAML above includes a runbook_url annotation. When the alert fires at 3 AM, the on-call engineer should not have to think — they should follow the documented remediation steps. Invest the 30 minutes to write the runbook when you create the alert, not during an incident.
4. Identical thresholds across environments. Your staging environment has 4GB RAM and 2 CPUs. Your production environment has 64GB and 16 CPUs. The predict_linear memory alert works the same way in both, but connection pool thresholds, I/O latency baselines, and error budget windows need to be environment-specific. Use Prometheus label selectors or Grafana alert rule templates to differentiate.
---
FAQ
Q: We use Datadog/New Relic instead of Grafana. Do these alerts still apply?
The concepts — burn rate, connection saturation, I/O latency, memory prediction, cert expiry — are universal. The PromQL expressions translate directly to DQL (Datadog Query Language) or NRQL. The predict_linear function has equivalents in both platforms (forecast in Datadog, predictLinear in NRQL). The alert philosophy matters more than the syntax. If you are evaluating a move to open-source observability to reduce costs, Grafana with Prometheus and Loki gives you all of this at infrastructure cost only — no per-host licensing.
Q: How do we handle alert routing for a team spread across EU, India, and the Middle East?
Grafana's notification policies support time-based routing. Configure your contact points with timezone-aware schedules: EU on-call handles alerts from 06:00-14:00 UTC, India from 14:00-22:00 UTC, and your ME team covers the overnight window. For GDPR-specific alerts (like TLS expiry on EU-facing services), route exclusively to your EU team regardless of time — they have the regulatory context to assess impact. Use Grafana's label matchers to route team: security alerts to the appropriate regional security lead.
Q: We are a small team (3-5 engineers) and cannot maintain complex alerting. Where do we start?
Start with Alert 5 (TLS expiry) and Alert 1 (error budget burn rate). These two cover the highest-impact failure modes — complete service unavailability from expired certs, and gradual degradation from accumulating errors. They require minimal tuning (certificate expiry is binary, and your SLO target is a business decision, not a technical one). Add the remaining three as you build operational maturity. The Grafana provisioning YAML above is copy-paste ready — deploy it, set your notification channel, and iterate.
---
Related Reading
Proactive alerting is one piece of the operational efficiency puzzle. These deep-dives cover the adjacent problems:
---
Build Alerting That Prevents Instead of Reports
The difference between a mature operations team and a reactive one is not the number of alerts — it is the predictive power of each alert. Five well-tuned alerts that fire 15 minutes before an outage are worth more than 500 that fire 15 minutes after.
If your team is struggling with alert fatigue, noisy dashboards, or incident response that always feels like it starts too late, we help engineering teams design and implement observability stacks that actually prevent outageswe help engineering teams design and implement observability stacks that actually prevent outageshttps://www.techsaas.cloud/services/. From Prometheus and Grafana architecture to SLO framework design, our approach is built on the same principles outlined in this post — leading indicators, predictive alerting, and zero-noise operations.
Subscribe to our newsletter for weekly deep-dives into infrastructure, observability, and platform engineering for scaling teams.
Need help with thought-leadership?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.