Alerting

Centralized alerting architecture using Grafana as the single source of truth.

Overview

All alerting flows through Grafana, which has unified access to:

Prometheus - Metrics (CPU, memory, containers)
Loki - Logs (errors, exceptions, patterns)
Tempo - Traces (latency, errors)

Alerts are sent to Vidar for incident management and SRE agent automation.

Architecture

┌─────────────────────────────────────────────────────────┐
│                        Grafana                          │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐           │
│  │ Prometheus│  │   Loki    │  │   Tempo   │           │
│  │  Metrics  │  │   Logs    │  │  Traces   │           │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘           │
│        └──────────────┼──────────────┘                  │
│                       ▼                                 │
│              Grafana Alerting                           │
│        (Unified Rules Engine)                           │
└───────────────────────┬─────────────────────────────────┘
                        │ Webhook
                        ▼
┌───────────────────────────────────────────────────────┐
│                      Vidar                             │
│  ┌─────────────┐  ┌─────────────┐  ┌───────────────┐  │
│  │   Alerts    │  │  Incidents  │  │  SRE Agent    │  │
│  │   Engine    │  │   Tracker   │  │  (Auto-fix)   │  │
│  └─────────────┘  └─────────────┘  └───────────────┘  │
└───────────────────────────────────────────────────────┘

Configuration

Contact Point

Grafana sends alerts to Vidar via webhook. The {source_id} is a UUID from Vidar's aiops_alert_sources table.

Current source_id: 204873fc-cd07-47c7-b74f-be61cfbb9584

File: services/grafana/provisioning/alerting/contactpoints.yml

apiVersion: 1

contactPoints:
  - orgId: 1
    name: vidar
    receivers:
      - uid: vidar-webhook
        type: webhook
        settings:
          url: http://vidar-api:8000/webhooks/grafana/{source_id}
          httpMethod: POST
        disableResolveMessage: false

policies:
  - orgId: 1
    receiver: vidar
    group_by:
      - grafana_folder
      - alertname
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h

Alert Rules

Alert rules are provisioned from:

File: services/grafana/provisioning/alerting/rules.yml

See Alert Rule Catalog below.

Alert Rule Catalog

Host Alerts (Infrastructure folder)

Alert	Condition	Severity	Runbook
Host High CPU	CPU > 80% for 5m	Warning	Host-High-CPU
Host Critical CPU	CPU > 95% for 2m	Critical	Host-High-CPU
Host High Memory	Memory > 85% for 5m	Warning	Host-High-Memory
Host Critical Memory	Memory > 95% for 2m	Critical	Host-High-Memory
Host High Disk	Disk > 80% for 5m	Warning	Host-High-Disk
Host Critical Disk	Disk > 90% for 2m	Critical	Host-High-Disk

Container Alerts

Alert	Condition	Severity
Container Down	Container not running	Critical
Container High CPU	CPU > 80% for 5m	Warning
Container High Memory	Memory > 80% for 5m	Warning
Container Restart Loop	> 5 restarts in 10m	Warning

Service Alerts

Alert	Condition	Severity
Service Unavailable	Health check failed for 2m	Critical
High Error Rate	5xx > 5% for 5m	Warning
High Latency	P99 > 2s for 5m	Warning

Creating New Alerts

Via Grafana UI

Go to Alerting → Alert rules
Click + New alert rule
Configure query using Prometheus, Loki, or Tempo
Set threshold condition
Add labels: severity: warning or severity: critical
Add annotations:
- summary: Brief description
- description: Detailed info with {{ $labels.xxx }}
- runbook_url: Link to runbook

Via Provisioning

Add to services/grafana/provisioning/alerting/rules.yml:

groups:
  - orgId: 1
    name: My Alert Group
    folder: Infrastructure
    interval: 1m
    rules:
      - uid: my-alert-id
        title: My Alert Name
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: my_metric > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Alert summary"
          runbook_url: "https://..."

Vidar Integration

Webhook Payload

Grafana sends alerts in this format:

{
  "receiver": "vidar",
  "status": "firing",
  "alerts": [
    {
      "status": "firing",
      "labels": {
        "alertname": "HostHighCpuUsage",
        "severity": "warning",
        "instance": "node-exporter:9100"
      },
      "annotations": {
        "summary": "High CPU usage on node-exporter:9100",
        "runbook_url": "https://..."
      },
      "startsAt": "2024-01-05T10:00:00Z",
      "endsAt": "0001-01-01T00:00:00Z",
      "generatorURL": "https://grafana.ravenhelm.dev/..."
    }
  ]
}

Vidar Actions

When alerts arrive, Vidar:

Creates/updates alert instances
Creates incident if new critical alert
Spawns SRE agent for automated diagnosis
Agent can auto-remediate using runbooks

Troubleshooting

Alerts Not Reaching Vidar

Check Grafana contact point:

docker logs grafana 2>&1 | grep -i alert

Test webhook manually (use source_id from above):

curl -X POST http://vidar-api:8000/webhooks/grafana/204873fc-cd07-47c7-b74f-be61cfbb9584 \
  -H "Content-Type: application/json" \
  -d '{"status":"firing","alerts":[...]}'

Check Vidar logs:

docker logs vidar-api 2>&1 | grep -i grafana

Alert Not Firing

Check Grafana alerting state:
- Go to Alerting → Alert rules
- Check rule status (Normal, Pending, Firing)
Verify data source returns data:
- Go to Explore
- Run the alert query manually

Vidar - SRE automation platform
Grafana - Visualization platform
Runbooks - Operational procedures

Overview​

Architecture​

Configuration​

Contact Point​

Alert Rules​

Alert Rule Catalog​

Host Alerts (Infrastructure folder)​

Container Alerts​

Service Alerts​

Creating New Alerts​

Via Grafana UI​

Via Provisioning​

Vidar Integration​

Webhook Payload​

Vidar Actions​

Troubleshooting​

Alerts Not Reaching Vidar​

Alert Not Firing​

Related​