Dashboard Strategy

[[TOC]]

Philosophy

Dashboards tell stories, not just display metrics.

Dashboard Hierarchy

Home Dashboard (default)
  ├─→ Drill-down: Infrastructure
  ├─→ Drill-down: Identity & Security
  ├─→ Drill-down: AI & Agents
  ├─→ Drill-down: Voice & Telephony
  ├─→ Drill-down: DevOps & CI/CD
  ├─→ Drill-down: Observability Stack
  └─→ Drill-down: SRE Agent Performance (meta)

Dashboard: 01_Platform_Health

Purpose: Executive overview, demo entry point

Row 1: Key Performance Indicators

┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Services Up │   Active    │  Incidents  │    MTTR     │
│             │  Incidents  │  Resolved   │             │
│   26/26     │      0      │  Today: 3   │   43s avg   │
│  ✅ 100%    │             │             │             │
└─────────────┴─────────────┴─────────────┴─────────────┘

Query (Services Up):

sum(up{job="docker"})

Query (MTTR):

avg_over_time(incident_resolution_duration_seconds[24h])

Row 2: Traffic Overview

┌───────────────────────────────────────────────────────────┐
│ Request Rate by Service (last 1h)         [Time Series]  │
│                                                           │
│ Legend:                                                   │
│ ━━━ Traefik (450 req/min)                                │
│ ━━━ API Gateway (230 req/min)                            │
│ ━━━ Auth (89 req/min)                                    │
└───────────────────────────────────────────────────────────┘

Query:

sum(rate(http_requests_total[5m])) by (service) * 60

Row 3: Error Tracking

┌───────────────────────────────────────────────────────────┐
│ Error Rate by Service (last 15m)           [Bar Chart]   │
│                                                           │
│ Redis      ████ 0.02%                                     │
│ Postgres   ▏ 0.00%                                        │
│ Traefik    ██████ 0.15%                                   │
│ LangGraph  ██ 0.08%                                       │
└───────────────────────────────────────────────────────────┘

Query:

sum(rate({job="docker"} |= "error" [15m])) by (container)
/ sum(rate({job="docker"} [15m])) by (container)

Row 4: Recent Incidents (GitLab Integration)

┌───────────────────────────────────────────────────────────┐
│ Recent Incidents                            [Table]       │
│                                                           │
│ ID    │ Service   │ Status     │ Duration  │ Link        │
│ #847  │ Redis     │ ✅ Resolved│ 43s       │ [View]      │
│ #846  │ Postgres  │ ✅ Resolved│ 2m 15s    │ [View]      │
│ #843  │ API GW    │ ⚠️ Escalated│ 12m 4s   │ [View]      │
└───────────────────────────────────────────────────────────┘

Data Source: GitLab API
Query:

GET /api/v4/projects/{project_id}/issues?labels=incident,aiops-managed&state=all&per_page=10

Row 5: Agent Activity Stream

┌───────────────────────────────────────────────────────────┐
│ Agent Reasoning (Live SSE Stream)          [Logs Panel]   │
│                                                           │
│ [11:43:21] 🔍 Alert received: Redis connection failures   │
│ [11:43:24] 📊 Querying Loki: 847 errors in 2 minutes      │
│ [11:43:26] 🧠 Root cause: Container OOM killed            │
│ [11:43:28] 📖 Matched runbook: redis-conn-failure-v1      │
│ [11:43:30] 🔧 Executing: docker restart redis             │
│ [11:44:04] ✅ Health check passed. Service restored.      │
└───────────────────────────────────────────────────────────┘

Data Source: Loki
Query:

{job="agent-thoughts"} |= ""

Dashboard: 02_Infrastructure

Traefik Panels

Request rate (by route)
Latency percentiles (p50, p95, p99)
Error rate (4xx vs 5xx)
Backend health status

Queries:

# Request rate
sum(rate(traefik_service_requests_total[5m])) by (service)

# Latency p99
histogram_quantile(0.99, 
  sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (le, service)
)

# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service)
/ sum(rate(traefik_service_requests_total[5m])) by (service)

PostgreSQL Panels

Active connections
Query duration (avg, p95)
Deadlocks
Replication lag
Cache hit ratio

Queries:

# Active connections
pg_stat_database_numbackends

# Cache hit ratio
(sum(pg_stat_database_blks_hit) 
/ (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read)))
* 100

Redis Panels

Operations per second
Memory usage (used vs limit)
Eviction rate
Hit ratio

Queries:

# Ops/sec
sum(rate(redis_commands_processed_total[5m]))

# Memory usage %
(redis_memory_used_bytes / redis_memory_max_bytes) * 100

# Hit ratio
(redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)) * 100

Dashboard: 04_AI_Agents

Purpose: Monitor agent performance and cost

LangGraph Executions

Execution count (by agent type)
Average duration
Success rate
Error breakdown

Queries:

# Execution count
sum(rate(langgraph_execution_total[5m])) by (agent_type)

# Success rate
sum(rate(langgraph_execution_total{status="success"}[5m]))
/ sum(rate(langgraph_execution_total[5m]))
* 100

LLM Usage (Langfuse)

Token usage (input/output)
Cost per request
Model distribution
Trace duration

Dashboard: 08_SRE_Agent_Performance

Purpose: Meta-monitoring of the AIOps system itself

Incident Lifecycle Metrics

┌───────────────────────────────────────────────────────────┐
│ Incident Detection Latency                [Heatmap]       │
│ Time from service failure → alert fired                   │
│                                                           │
│ Avg: 8.2s | p95: 15s | p99: 23s                          │
└───────────────────────────────────────────────────────────┘

Query:

histogram_quantile(0.95,
  sum(rate(incident_detection_latency_seconds_bucket[5m])) by (le)
)

Remediation Success Rate by Runbook

┌───────────────────────────────────────────────────────────┐
│ Remediation Success Rate by Runbook       [Bar Chart]     │
│                                                           │
│ redis-conn-failure-v1     ████████████████ 94%            │
│ container-restart         █████████████ 87%               │
│ oom-prevention            ██████████ 78%                  │
│ postgres-deadlock         ████████ 65%                    │
└───────────────────────────────────────────────────────────┘

Query:

sum(agent_remediation_success_total) by (runbook)
/ sum(agent_remediation_attempts_total) by (runbook)
* 100

Cost Metrics

┌───────────────────────────────────────────────────────────┐
│ Manual Hours Saved (estimated)            [Stat Panel]    │
│                                                           │
│ This Month: 47.2 hours                                    │
│ Cost Saved: $4,720 (@ $100/hr)                            │
└───────────────────────────────────────────────────────────┘

Formula:

(incidents_resolved_autonomously * avg_human_response_time_hours) * hourly_rate

Next: [[AIOps-Alert-Rules]] - Tier 0/1/2 alert configuration

Philosophy​

Dashboard Hierarchy​

Dashboard: 01_Platform_Health​

Row 1: Key Performance Indicators​

Row 2: Traffic Overview​

Row 3: Error Tracking​

Row 4: Recent Incidents (GitLab Integration)​

Row 5: Agent Activity Stream​

Dashboard: 02_Infrastructure​

Traefik Panels​

PostgreSQL Panels​

Redis Panels​

Dashboard: 04_AI_Agents​

LangGraph Executions​

LLM Usage (Langfuse)​

Dashboard: 08_SRE_Agent_Performance​

Incident Lifecycle Metrics​

Remediation Success Rate by Runbook​

Cost Metrics​

Philosophy

Dashboard Hierarchy

Dashboard: 01_Platform_Health

Row 1: Key Performance Indicators

Row 2: Traffic Overview

Row 3: Error Tracking

Row 4: Recent Incidents (GitLab Integration)

Row 5: Agent Activity Stream

Dashboard: 02_Infrastructure

Traefik Panels

PostgreSQL Panels

Redis Panels

Dashboard: 04_AI_Agents

LangGraph Executions

LLM Usage (Langfuse)

Dashboard: 08_SRE_Agent_Performance

Incident Lifecycle Metrics

Remediation Success Rate by Runbook

Cost Metrics