Skip to main content

Dashboard Strategy

[[TOC]]

Philosophy

Dashboards tell stories, not just display metrics.

Dashboard Hierarchy

Home Dashboard (default)
├─→ Drill-down: Infrastructure
├─→ Drill-down: Identity & Security
├─→ Drill-down: AI & Agents
├─→ Drill-down: Voice & Telephony
├─→ Drill-down: DevOps & CI/CD
├─→ Drill-down: Observability Stack
└─→ Drill-down: SRE Agent Performance (meta)

Dashboard: 01_Platform_Health

Purpose: Executive overview, demo entry point

Row 1: Key Performance Indicators

┌─────────────┬─────────────┬─────────────┬─────────────┐
│ Services Up │ Active │ Incidents │ MTTR │
│ │ Incidents │ Resolved │ │
│ 26/26 │ 0 │ Today: 3 │ 43s avg │
│ ✅ 100% │ │ │ │
└─────────────┴─────────────┴─────────────┴─────────────┘

Query (Services Up):

sum(up{job="docker"})

Query (MTTR):

avg_over_time(incident_resolution_duration_seconds[24h])

Row 2: Traffic Overview

┌───────────────────────────────────────────────────────────┐
│ Request Rate by Service (last 1h) [Time Series] │
│ │
│ Legend: │
│ ━━━ Traefik (450 req/min) │
│ ━━━ API Gateway (230 req/min) │
│ ━━━ Auth (89 req/min) │
└───────────────────────────────────────────────────────────┘

Query:

sum(rate(http_requests_total[5m])) by (service) * 60

Row 3: Error Tracking

┌───────────────────────────────────────────────────────────┐
│ Error Rate by Service (last 15m) [Bar Chart] │
│ │
│ Redis ████ 0.02% │
│ Postgres ▏ 0.00% │
│ Traefik ██████ 0.15% │
│ LangGraph ██ 0.08% │
└───────────────────────────────────────────────────────────┘

Query:

sum(rate({job="docker"} |= "error" [15m])) by (container)
/ sum(rate({job="docker"} [15m])) by (container)

Row 4: Recent Incidents (GitLab Integration)

┌───────────────────────────────────────────────────────────┐
│ Recent Incidents [Table] │
│ │
│ ID │ Service │ Status │ Duration │ Link │
│ #847 │ Redis │ ✅ Resolved│ 43s │ [View] │
│ #846 │ Postgres │ ✅ Resolved│ 2m 15s │ [View] │
│ #843 │ API GW │ ⚠️ Escalated│ 12m 4s │ [View] │
└───────────────────────────────────────────────────────────┘

Data Source: GitLab API
Query:

GET /api/v4/projects/{project_id}/issues?labels=incident,aiops-managed&state=all&per_page=10

Row 5: Agent Activity Stream

┌───────────────────────────────────────────────────────────┐
│ Agent Reasoning (Live SSE Stream) [Logs Panel] │
│ │
│ [11:43:21] 🔍 Alert received: Redis connection failures │
│ [11:43:24] 📊 Querying Loki: 847 errors in 2 minutes │
│ [11:43:26] 🧠 Root cause: Container OOM killed │
│ [11:43:28] 📖 Matched runbook: redis-conn-failure-v1 │
│ [11:43:30] 🔧 Executing: docker restart redis │
│ [11:44:04] ✅ Health check passed. Service restored. │
└───────────────────────────────────────────────────────────┘

Data Source: Loki
Query:

{job="agent-thoughts"} |= ""

Dashboard: 02_Infrastructure

Traefik Panels

  • Request rate (by route)
  • Latency percentiles (p50, p95, p99)
  • Error rate (4xx vs 5xx)
  • Backend health status

Queries:

# Request rate
sum(rate(traefik_service_requests_total[5m])) by (service)

# Latency p99
histogram_quantile(0.99,
sum(rate(traefik_service_request_duration_seconds_bucket[5m])) by (le, service)
)

# Error rate
sum(rate(traefik_service_requests_total{code=~"5.."}[5m])) by (service)
/ sum(rate(traefik_service_requests_total[5m])) by (service)

PostgreSQL Panels

  • Active connections
  • Query duration (avg, p95)
  • Deadlocks
  • Replication lag
  • Cache hit ratio

Queries:

# Active connections
pg_stat_database_numbackends

# Cache hit ratio
(sum(pg_stat_database_blks_hit)
/ (sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read)))
* 100

Redis Panels

  • Operations per second
  • Memory usage (used vs limit)
  • Eviction rate
  • Hit ratio

Queries:

# Ops/sec
sum(rate(redis_commands_processed_total[5m]))

# Memory usage %
(redis_memory_used_bytes / redis_memory_max_bytes) * 100

# Hit ratio
(redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)) * 100

Dashboard: 04_AI_Agents

Purpose: Monitor agent performance and cost

LangGraph Executions

  • Execution count (by agent type)
  • Average duration
  • Success rate
  • Error breakdown

Queries:

# Execution count
sum(rate(langgraph_execution_total[5m])) by (agent_type)

# Success rate
sum(rate(langgraph_execution_total{status="success"}[5m]))
/ sum(rate(langgraph_execution_total[5m]))
* 100

LLM Usage (Langfuse)

  • Token usage (input/output)
  • Cost per request
  • Model distribution
  • Trace duration

Dashboard: 08_SRE_Agent_Performance

Purpose: Meta-monitoring of the AIOps system itself

Incident Lifecycle Metrics

┌───────────────────────────────────────────────────────────┐
│ Incident Detection Latency [Heatmap] │
│ Time from service failure → alert fired │
│ │
│ Avg: 8.2s | p95: 15s | p99: 23s │
└───────────────────────────────────────────────────────────┘

Query:

histogram_quantile(0.95,
sum(rate(incident_detection_latency_seconds_bucket[5m])) by (le)
)

Remediation Success Rate by Runbook

┌───────────────────────────────────────────────────────────┐
│ Remediation Success Rate by Runbook [Bar Chart] │
│ │
│ redis-conn-failure-v1 ████████████████ 94% │
│ container-restart █████████████ 87% │
│ oom-prevention ██████████ 78% │
│ postgres-deadlock ████████ 65% │
└───────────────────────────────────────────────────────────┘

Query:

sum(agent_remediation_success_total) by (runbook)
/ sum(agent_remediation_attempts_total) by (runbook)
* 100

Cost Metrics

┌───────────────────────────────────────────────────────────┐
│ Manual Hours Saved (estimated) [Stat Panel] │
│ │
│ This Month: 47.2 hours │
│ Cost Saved: $4,720 (@ $100/hr) │
└───────────────────────────────────────────────────────────┘

Formula:

(incidents_resolved_autonomously * avg_human_response_time_hours) * hourly_rate

Next: [[AIOps-Alert-Rules]] - Tier 0/1/2 alert configuration