Skip to main content

Alert Rules & Routing

[[TOC]]

Alert Philosophy

Principles:

  1. Alert on symptoms, not causes - User impact, not resource utilization
  2. Every alert is actionable - Must suggest next step or auto-remediate
  3. Route by blast radius - Critical → agent immediate, warning → agent investigate
  4. Test with chaos - If untestable, delete it

Alert Tiers

TierDescriptionAgent ActionHuman Notification
Tier 0Autonomous RemediationRemediate silentlyNone
Tier 1Supervised RemediationAttempt resolutionIf confidence < threshold
Tier 2Human-OnlyEscalate immediatelyAlways

Tier 0: Autonomous Remediation

No human notification. Agent handles silently.

File: /etc/prometheus/rules/tier0_autonomous.yml

groups:
- name: tier0_autonomous_remediation
interval: 30s
rules:
- alert: ContainerRestartLoop
expr: |
rate(container_start_time_seconds[2m]) > 0.5
for: 1m
labels:
severity: warning
tier: autonomous
agent_action: remediate
runbook_id: container-restart-loop-v1
annotations:
summary: "Container {{ $labels.container }} in restart loop"
description: "Container has restarted {{ $value | humanize }} times in 2 minutes"
context_needed: "container_logs,recent_events"

- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 2m
labels:
severity: warning
tier: autonomous
agent_action: remediate
runbook_id: oom-prevention-v1
annotations:
summary: "Container {{ $labels.container }} memory usage critical"
description: "Memory usage at {{ $value | humanizePercentage }}"
recommended_action: "Increase memory limit or restart container"

- alert: RedisEvictionSpike
expr: |
rate(redis_evicted_keys_total[5m]) > 100
for: 2m
labels:
severity: warning
tier: autonomous
agent_action: investigate
runbook_id: redis-memory-pressure-v1
annotations:
summary: "Redis evicting keys at high rate"
description: "{{ $value | humanize }} keys/sec being evicted"

- alert: PostgresDeadlock
expr: |
rate(pg_stat_database_deadlocks[5m]) > 0
labels:
severity: warning
tier: autonomous
agent_action: investigate
runbook_id: postgres-deadlock-analysis-v1
annotations:
summary: "PostgreSQL deadlock detected"
description: "{{ $value | humanize }} deadlocks in last 5 minutes"

- alert: TraefikBackendDown
expr: |
traefik_backend_server_up == 0
for: 1m
labels:
severity: critical
tier: autonomous
agent_action: remediate
runbook_id: backend-health-check-failure-v1
annotations:
summary: "Backend {{ $labels.backend }} is down"
description: "Traefik health check failing"

- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)) > 0.05
for: 2m
labels:
severity: warning
tier: autonomous
agent_action: investigate
runbook_id: high-error-rate-investigation-v1
annotations:
summary: "Service {{ $labels.service }} error rate elevated"
description: "{{ $value | humanizePercentage }} of requests failing"

Tier 1: Agent Investigates, Escalates if Uncertain

Agent attempts resolution. Escalates to human if confidence < threshold.

File: /etc/prometheus/rules/tier1_supervised.yml

groups:
- name: tier1_supervised_remediation
interval: 1m
rules:
- alert: LangGraphExecutionFailure
expr: |
rate(langgraph_execution_errors_total[5m]) > 0.1
labels:
severity: warning
tier: supervised
agent_action: investigate
escalate_threshold: "0.7"
runbook_id: langgraph-execution-failure-v1
annotations:
summary: "LangGraph agent failures elevated"
description: "{{ $value | humanize }} failures/sec"

- alert: OllamaModelLoadFailure
expr: |
rate(ollama_model_load_errors_total[5m]) > 0
labels:
severity: critical
tier: supervised
agent_action: investigate
escalate_threshold: "0.8"
runbook_id: ollama-model-load-failure-v1
annotations:
summary: "Ollama failing to load models"
description: "Check model availability and memory"

- alert: ZitadelAuthFailureSpike
expr: |
(rate(zitadel_auth_failures_total[5m])
/ rate(zitadel_auth_attempts_total[5m])) > 0.1
for: 3m
labels:
severity: warning
tier: supervised
agent_action: investigate
escalate_threshold: "0.7"
runbook_id: auth-failure-spike-v1
annotations:
summary: "Authentication failures spiking"
description: "{{ $value | humanizePercentage }} of auth attempts failing"
context_needed: "user_agents,ip_addresses,error_types"

- alert: LiveKitRoomCreationFailure
expr: |
rate(livekit_room_creation_errors_total[5m]) > 0.1
labels:
severity: critical
tier: supervised
agent_action: investigate
escalate_threshold: "0.8"
runbook_id: livekit-room-failure-v1
annotations:
summary: "LiveKit room creation failing"
description: "Voice calls may be impacted"

Tier 2: Human-Only Alerts

Immediate escalation. No agent attempt.

File: /etc/prometheus/rules/tier2_escalate.yml

groups:
- name: tier2_immediate_escalation
interval: 1m
rules:
- alert: DataLossRisk
expr: |
pg_stat_replication_replay_lag_seconds > 60
for: 5m
labels:
severity: critical
tier: human_only
agent_action: escalate_immediately
oncall_team: dba
annotations:
summary: "PostgreSQL replication lag critical"
description: "Replication {{ $value | humanizeDuration }} behind"
impact: "Data loss risk if primary fails"

- alert: SecurityIncidentSuspected
expr: |
rate(spire_svid_invalid_requests_total[5m]) > 10
labels:
severity: critical
tier: human_only
agent_action: escalate_immediately
oncall_team: security
annotations:
summary: "Abnormal SPIRE SVID requests"
description: "Possible workload identity compromise"

- alert: BillingAnomalyDetected
expr: |
increase(anthropic_api_requests_total[1h]) > 10000
labels:
severity: warning
tier: human_only
agent_action: escalate_immediately
oncall_team: platform_lead
annotations:
summary: "Unexpected API usage spike"
description: "{{ $value | humanize }} requests in last hour"
impact: "Cost overrun risk"

- alert: DiskSpaceCritical
expr: |
(node_filesystem_avail_bytes{mountpoint="/"}
/ node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: critical
tier: human_only
agent_action: escalate_immediately
annotations:
summary: "Disk space critically low"
description: "Only {{ $value | humanizePercentage }} available"
impact: "System may become unresponsive"

Grafana Contact Points

File: /etc/grafana/provisioning/alerting/contactpoints.yaml

apiVersion: 1
contactPoints:
# Primary: Kafka webhook for agent consumption
- orgId: 1
name: sre-agent-kafka
receivers:
- uid: kafka-webhook-primary
type: webhook
settings:
url: http://kafka-bridge:8080/alerts
httpMethod: POST
maxAlerts: 0 # Send all alerts
disableResolveMessage: false

# Secondary: Slack for human escalations
- orgId: 1
name: platform-oncall-slack
receivers:
- uid: slack-platform
type: slack
settings:
url: "${SLACK_WEBHOOK_URL}"
disableResolveMessage: false

# Tertiary: PagerDuty for critical escalations
- orgId: 1
name: pagerduty-critical
receivers:
- uid: pagerduty-sre
type: pagerduty
settings:
integrationKey: "${PAGERDUTY_INTEGRATION_KEY}"
severity: critical
disableResolveMessage: false

Notification Policies

File: /etc/grafana/provisioning/alerting/policies.yaml

apiVersion: 1
policies:
- orgId: 1
receiver: sre-agent-kafka
group_by: ['alertname', 'container', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 12h

routes:
# Tier 0: All autonomous alerts → Kafka only
- receiver: sre-agent-kafka
matchers:
- tier = autonomous
group_wait: 5s
group_interval: 2m
continue: false

# Tier 1: Supervised → Kafka, escalate to Slack if unresolved
- receiver: sre-agent-kafka
matchers:
- tier = supervised
group_wait: 10s
group_interval: 5m
continue: true
routes:
- receiver: platform-oncall-slack
matchers:
- tier = supervised
group_wait: 10m
repeat_interval: 30m

# Tier 2: Human-only → Slack immediately, PagerDuty if critical
- receiver: platform-oncall-slack
matchers:
- tier = human_only
group_wait: 30s
continue: true
routes:
- receiver: pagerduty-critical
matchers:
- tier = human_only
- severity = critical
group_wait: 1m

Next: [[AIOps-Agent-Architecture]] - LangGraph state machine and Kafka consumer