Skip to main content

Alerting

Centralized alerting architecture using Grafana as the single source of truth.


Overview

All alerting flows through Grafana, which has unified access to:

  • Prometheus - Metrics (CPU, memory, containers)
  • Loki - Logs (errors, exceptions, patterns)
  • Tempo - Traces (latency, errors)

Alerts are sent to Vidar for incident management and SRE agent automation.


Architecture

┌─────────────────────────────────────────────────────────┐
│ Grafana │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Prometheus│ │ Loki │ │ Tempo │ │
│ │ Metrics │ │ Logs │ │ Traces │ │
│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │
│ └──────────────┼──────────────┘ │
│ ▼ │
│ Grafana Alerting │
│ (Unified Rules Engine) │
└───────────────────────┬─────────────────────────────────┘
│ Webhook

┌───────────────────────────────────────────────────────┐
│ Vidar │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────────┐ │
│ │ Alerts │ │ Incidents │ │ SRE Agent │ │
│ │ Engine │ │ Tracker │ │ (Auto-fix) │ │
│ └─────────────┘ └─────────────┘ └───────────────┘ │
└───────────────────────────────────────────────────────┘

Configuration

Contact Point

Grafana sends alerts to Vidar via webhook. The {source_id} is a UUID from Vidar's aiops_alert_sources table.

Current source_id: 204873fc-cd07-47c7-b74f-be61cfbb9584

File: services/grafana/provisioning/alerting/contactpoints.yml

apiVersion: 1

contactPoints:
- orgId: 1
name: vidar
receivers:
- uid: vidar-webhook
type: webhook
settings:
url: http://vidar-api:8000/webhooks/grafana/{source_id}
httpMethod: POST
disableResolveMessage: false

policies:
- orgId: 1
receiver: vidar
group_by:
- grafana_folder
- alertname
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

Alert Rules

Alert rules are provisioned from:

File: services/grafana/provisioning/alerting/rules.yml

See Alert Rule Catalog below.


Alert Rule Catalog

Host Alerts (Infrastructure folder)

AlertConditionSeverityRunbook
Host High CPUCPU > 80% for 5mWarningHost-High-CPU
Host Critical CPUCPU > 95% for 2mCriticalHost-High-CPU
Host High MemoryMemory > 85% for 5mWarningHost-High-Memory
Host Critical MemoryMemory > 95% for 2mCriticalHost-High-Memory
Host High DiskDisk > 80% for 5mWarningHost-High-Disk
Host Critical DiskDisk > 90% for 2mCriticalHost-High-Disk

Container Alerts

AlertConditionSeverity
Container DownContainer not runningCritical
Container High CPUCPU > 80% for 5mWarning
Container High MemoryMemory > 80% for 5mWarning
Container Restart Loop> 5 restarts in 10mWarning

Service Alerts

AlertConditionSeverity
Service UnavailableHealth check failed for 2mCritical
High Error Rate5xx > 5% for 5mWarning
High LatencyP99 > 2s for 5mWarning

Creating New Alerts

Via Grafana UI

  1. Go to Alerting → Alert rules
  2. Click + New alert rule
  3. Configure query using Prometheus, Loki, or Tempo
  4. Set threshold condition
  5. Add labels: severity: warning or severity: critical
  6. Add annotations:
    • summary: Brief description
    • description: Detailed info with {{ $labels.xxx }}
    • runbook_url: Link to runbook

Via Provisioning

Add to services/grafana/provisioning/alerting/rules.yml:

groups:
- orgId: 1
name: My Alert Group
folder: Infrastructure
interval: 1m
rules:
- uid: my-alert-id
title: My Alert Name
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: my_metric > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Alert summary"
runbook_url: "https://..."

Vidar Integration

Webhook Payload

Grafana sends alerts in this format:

{
"receiver": "vidar",
"status": "firing",
"alerts": [
{
"status": "firing",
"labels": {
"alertname": "HostHighCpuUsage",
"severity": "warning",
"instance": "node-exporter:9100"
},
"annotations": {
"summary": "High CPU usage on node-exporter:9100",
"runbook_url": "https://..."
},
"startsAt": "2024-01-05T10:00:00Z",
"endsAt": "0001-01-01T00:00:00Z",
"generatorURL": "https://grafana.ravenhelm.dev/..."
}
]
}

Vidar Actions

When alerts arrive, Vidar:

  1. Creates/updates alert instances
  2. Creates incident if new critical alert
  3. Spawns SRE agent for automated diagnosis
  4. Agent can auto-remediate using runbooks

Troubleshooting

Alerts Not Reaching Vidar

  1. Check Grafana contact point:

    docker logs grafana 2>&1 | grep -i alert
  2. Test webhook manually (use source_id from above):

    curl -X POST http://vidar-api:8000/webhooks/grafana/204873fc-cd07-47c7-b74f-be61cfbb9584 \
    -H "Content-Type: application/json" \
    -d '{"status":"firing","alerts":[...]}'
  3. Check Vidar logs:

    docker logs vidar-api 2>&1 | grep -i grafana

Alert Not Firing

  1. Check Grafana alerting state:

    • Go to Alerting → Alert rules
    • Check rule status (Normal, Pending, Firing)
  2. Verify data source returns data:

    • Go to Explore
    • Run the alert query manually

  • Vidar - SRE automation platform
  • Grafana - Visualization platform
  • Runbooks - Operational procedures