Alerting
Centralized alerting architecture using Grafana as the single source of truth.
Overview
All alerting flows through Grafana, which has unified access to:
- Prometheus - Metrics (CPU, memory, containers)
- Loki - Logs (errors, exceptions, patterns)
- Tempo - Traces (latency, errors)
Alerts are sent to Vidar for incident management and SRE agent automation.
Architecture
┌─────────────────────────────────────────────────────────┐
│ Grafana │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Prometheus│ │ Loki │ │ Tempo │ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ └──────────────┼──────────────┘ │
│ ▼ │
│ Grafana Alerting │
│ (Unified Rules Engine) │
└───────────────────────┬─────────────────────────────────┘
│ Webhook
▼
┌───────────────────────────────────────────────────────┐
│ Vidar │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ Alerts │ │ Incidents │ │ SRE Agent │ │
│ │ Engine │ │ Tracker │ │ (Auto-fix) │ │
│ └─────────────┘ └─────────────┘ └───────────────┘ │
└───────────────────────────────────────────────────────┘
Configuration
Contact Point
Grafana sends alerts to Vidar via webhook. The {source_id} is a UUID from Vidar's aiops_alert_sources table.
Current source_id: 204873fc-cd07-47c7-b74f-be61cfbb9584
File: services/grafana/provisioning/alerting/contactpoints.yml
apiVersion: 1
contactPoints:
- orgId: 1
name: vidar
receivers:
- uid: vidar-webhook
type: webhook
settings:
url: http://vidar-api:8000/webhooks/grafana/{source_id}
httpMethod: POST
disableResolveMessage: false
policies:
- orgId: 1
receiver: vidar
group_by:
- grafana_folder
- alertname
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
Alert Rules
Alert rules are provisioned from:
File: services/grafana/provisioning/alerting/rules.yml
See Alert Rule Catalog below.
Alert Rule Catalog
Host Alerts (Infrastructure folder)
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| Host High CPU | CPU > 80% for 5m | Warning | Host-High-CPU |
| Host Critical CPU | CPU > 95% for 2m | Critical | Host-High-CPU |
| Host High Memory | Memory > 85% for 5m | Warning | Host-High-Memory |
| Host Critical Memory | Memory > 95% for 2m | Critical | Host-High-Memory |
| Host High Disk | Disk > 80% for 5m | Warning | Host-High-Disk |
| Host Critical Disk | Disk > 90% for 2m | Critical | Host-High-Disk |
Container Alerts
| Alert | Condition | Severity |
|---|---|---|
| Container Down | Container not running | Critical |
| Container High CPU | CPU > 80% for 5m | Warning |
| Container High Memory | Memory > 80% for 5m | Warning |
| Container Restart Loop | > 5 restarts in 10m | Warning |
Service Alerts
| Alert | Condition | Severity |
|---|---|---|
| Service Unavailable | Health check failed for 2m | Critical |
| High Error Rate | 5xx > 5% for 5m | Warning |
| High Latency | P99 > 2s for 5m | Warning |
Creating New Alerts
Via Grafana UI
- Go to Alerting → Alert rules
- Click + New alert rule
- Configure query using Prometheus, Loki, or Tempo
- Set threshold condition
- Add labels:
severity: warningorseverity: critical - Add annotations:
summary: Brief descriptiondescription: Detailed info with{{ $labels.xxx }}runbook_url: Link to runbook
Via Provisioning
Add to services/grafana/provisioning/alerting/rules.yml:
groups:
- orgId: 1
name: My Alert Group
folder: Infrastructure
interval: 1m
rules:
- uid: my-alert-id
title: My Alert Name
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: my_metric > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Alert summary"
runbook_url: "https://..."
Vidar Integration
Webhook Payload
Grafana sends alerts in this format:
{
"receiver": "vidar",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "HostHighCpuUsage",
"severity": "warning",
"instance": "node-exporter:9100"
},
"annotations": {
"summary": "High CPU usage on node-exporter:9100",
"runbook_url": "https://..."
},
"startsAt": "2024-01-05T10:00:00Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "https://grafana.ravenhelm.dev/..."
}
]
}
Vidar Actions
When alerts arrive, Vidar:
- Creates/updates alert instances
- Creates incident if new critical alert
- Spawns SRE agent for automated diagnosis
- Agent can auto-remediate using runbooks
Troubleshooting
Alerts Not Reaching Vidar
-
Check Grafana contact point:
docker logs grafana 2>&1 | grep -i alert -
Test webhook manually (use source_id from above):
curl -X POST http://vidar-api:8000/webhooks/grafana/204873fc-cd07-47c7-b74f-be61cfbb9584 \
-H "Content-Type: application/json" \
-d '{"status":"firing","alerts":[...]}' -
Check Vidar logs:
docker logs vidar-api 2>&1 | grep -i grafana
Alert Not Firing
-
Check Grafana alerting state:
- Go to Alerting → Alert rules
- Check rule status (Normal, Pending, Firing)
-
Verify data source returns data:
- Go to Explore
- Run the alert query manually