Dashboard Strategy
[[TOC]]
Philosophy
Dashboards tell stories, not just display metrics.
Dashboard Hierarchy
Home Dashboard (default)
├─→ Drill-down: Infrastructure
├─→ Drill-down: Identity & Security
├─→ Drill-down: AI & Agents
├─→ Drill-down: Voice & Telephony
├─→ Drill-down: DevOps & CI/CD
├─→ Drill-down: Observability Stack
└─→ Drill-down: SRE Agent Performance (meta)
Dashboard: 01_Platform_Health
Purpose: Executive overview, demo entry point
Row 1: Key Performance Indicators
┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Services Up │ Active │ Incidents │ MTTR │
│ │ Incidents │ Resolved │ │
│ 26/26 │ 0 │ Today: 3 │ 43s avg │
│ ✅ 100% │ │ │ │
└─────────────┴─────────────┴─────────────┴─────────────┘
Query (Services Up):
sum(up{job="docker"})
Query (MTTR):
avg_over_time(incident_resolution_duration_seconds[24h])
Row 2: Traffic Overview
┌───────────────────────────────────────────────────────────┐
│ Request Rate by Service (last 1h) [Time Series] │
│ │
│ Legend: │
│ ━━━ Traefik (450 req/min) │
│ ━━━ API Gateway (230 req/min) │
│ ━━━ Auth (89 req/min) │
└───────────────────────────────────────────────────────────┘
Query:
sum(rate(http_requests_total[5m])) by (service) * 60
Row 3: Error Tracking
┌───────────────────────────────────────────────────────────┐
│ Error Rate by Service (last 15m) [Bar Chart] │
│ │
│ Redis ████ 0.02% │
│ Postgres ▏ 0.00% │
│ Traefik ██████ 0.15% │
│ LangGraph ██ 0.08% │
└───────────────────────────────────────────────────────────┘
Query:
sum(rate({job="docker"} |= "error" [15m])) by (container)
/ sum(rate({job="docker"} [15m])) by (container)
Row 4: Recent Incidents (GitLab Integration)
┌───────────────────────────────────────────────────────────┐
│ Recent Incidents [Table] │
│ │
│ ID │ Service │ Status │ Duration │ Link │
│ #847 │ Redis │ ✅ Resolved│ 43s │ [View] │
│ #846 │ Postgres │ ✅ Resolved│ 2m 15s │ [View] │
│ #843 │ API GW │ ⚠️ Escalated│ 12m 4s │ [View] │
└───────────────────────────────────────────────────────────┘
Data Source: GitLab API
Query:
GET /api/v4/projects/{project_id}/issues?labels=incident,aiops-managed&state=all&per_page=10
Row 5: Agent Activity Stream
┌───────────────────────────────────────────────────────────┐
│ Agent Reasoning (Live SSE Stream) [Logs Panel] │
│ │
│ [11:43:21] 🔍 Alert received: Redis connection failures │
│ [11:43:24] 📊 Querying Loki: 847 errors in 2 minutes │
│ [11:43:26] 🧠 Root cause: Container OOM killed │
│ [11:43:28] 📖 Matched runbook: redis-conn-failure-v1 │
│ [11:43:30] 🔧 Executing: docker restart redis │
│ [11:44:04] ✅ Health check passed. Service restored. │
└───────────────────────────────────────────────────────────┘
Data Source: Loki
Query:
{job="agent-thoughts"} |= ""
Dashboard: 02_Infrastructure
Traefik Panels
- Request rate (by route)
- Latency percentiles (p50, p95, p99)
- Error rate (4xx vs 5xx)
- Backend health status
Queries:
# Request rate
sum(rate(traefik_service_requests_total[5m])) by (service)
# Latency p99
histogram_quantile(0.99,
sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (le, service)
)
# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service)
/ sum(rate(traefik_service_requests_total[5m])) by (service)
PostgreSQL Panels
- Active connections
- Query duration (avg, p95)
- Deadlocks
- Replication lag
- Cache hit ratio
Queries:
# Active connections
pg_stat_database_numbackends
# Cache hit ratio
(sum(pg_stat_database_blks_hit)
/ (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read)))
* 100
Redis Panels
- Operations per second
- Memory usage (used vs limit)
- Eviction rate
- Hit ratio
Queries:
# Ops/sec
sum(rate(redis_commands_processed_total[5m]))
# Memory usage %
(redis_memory_used_bytes / redis_memory_max_bytes) * 100
# Hit ratio
(redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)) * 100
Dashboard: 04_AI_Agents
Purpose: Monitor agent performance and cost
LangGraph Executions
- Execution count (by agent type)
- Average duration
- Success rate
- Error breakdown
Queries:
# Execution count
sum(rate(langgraph_execution_total[5m])) by (agent_type)
# Success rate
sum(rate(langgraph_execution_total{status="success"}[5m]))
/ sum(rate(langgraph_execution_total[5m]))
* 100
LLM Usage (Langfuse)
- Token usage (input/output)
- Cost per request
- Model distribution
- Trace duration
Dashboard: 08_SRE_Agent_Performance
Purpose: Meta-monitoring of the AIOps system itself
Incident Lifecycle Metrics
┌───────────────────────────────────────────────────────────┐
│ Incident Detection Latency [Heatmap] │
│ Time from service failure → alert fired │
│ │
│ Avg: 8.2s | p95: 15s | p99: 23s │
└───────────────────────────────────────────────────────────┘
Query:
histogram_quantile(0.95,
sum(rate(incident_detection_latency_seconds_bucket[5m])) by (le)
)
Remediation Success Rate by Runbook
┌───────────────────────────────────────────────────────────┐
│ Remediation Success Rate by Runbook [Bar Chart] │
│ │
│ redis-conn-failure-v1 ████████████████ 94% │
│ container-restart █████████████ 87% │
│ oom-prevention ██████████ 78% │
│ postgres-deadlock ████████ 65% │
└───────────────────────────────────────────────────────────┘
Query:
sum(agent_remediation_success_total) by (runbook)
/ sum(agent_remediation_attempts_total) by (runbook)
* 100
Cost Metrics
┌───────────────────────────────────────────────────────────┐
│ Manual Hours Saved (estimated) [Stat Panel] │
│ │
│ This Month: 47.2 hours │
│ Cost Saved: $4,720 (@ $100/hr) │
└───────────────────────────────────────────────────────────┘
Formula:
(incidents_resolved_autonomously * avg_human_response_time_hours) * hourly_rate
Next: [[AIOps-Alert-Rules]] - Tier 0/1/2 alert configuration