Demo Choreography & Implementation Roadmap
[[TOC]]
Demo Script (2-Minute Full Cycle)
Objective: Show autonomous incident detection -> investigation -> remediation -> GitLab audit trail
Prerequisites
- Grafana dashboard open on screen 1
- GitLab issues board open on screen 2
- Terminal with chaos injection API ready
Act 1: Baseline (30 seconds)
[Screen 1: Platform Health Dashboard]
Presenter: "This is Ravenhelm, our AI-powered platform. Everything's green-
26 services running, zero incidents in the last hour."
[Show KPI panel]
- Services: 26/26 (100%)
- Active Incidents: 0
- MTTR: 43s avg
Act 2: Chaos Injection (15 seconds)
[Terminal]
Presenter: "Let me introduce a realistic failure-Redis memory pressure."
Command:
$ curl -X POST http://localhost:8080/chaos/memory-fill/docs/infrastructure/redis
Output:
{
"status": "injecting",
"service": "redis",
"type": "memory_fill",
"target_usage": "95%"
}
[Screen 1: Dashboard updates]
- Redis memory usage spikes to 94%
- Error rate panel shows uptick
- Alert fires: "HighMemoryUsage"
Act 3: Agent Detection & Investigation (45 seconds)
[Screen 1: Agent Activity Panel]
Agent reasoning stream appears:
[11:43:21] Alert received: High memory usage - Redis
[11:43:24] Querying Loki for Redis logs (last 5 minutes)...
[11:43:27] Querying Prometheus for memory metrics...
[11:43:32] Analysis complete
Root cause: Memory limit insufficient for current load
Confidence: 92%
Recommended action: Increase memory limit
[11:43:35] Matched runbook: oom-prevention-v1
[11:43:37] Executing: Increase Redis memory limit (512MB -> 768MB)
[Screen 2: GitLab]
New issue appears:
#849 - Incident: HighMemoryUsage - Redis
Status: Investigating
Labels: incident, severity::warning, service::redis, aiops-managed
Timeline updates in real-time with agent actions.
Act 4: Remediation & Resolution (30 seconds)
[Screen 1: Agent Activity Panel]
[11:43:42] Waiting for container to stabilize...
[11:43:52] Health check passed
[11:43:54] Verifying success: Memory usage now at 68%
[11:43:56] Incident resolved autonomously
[Screen 1: Dashboard]
- Services: 26/26 (still 100%)
- Active Incidents: 0
- Recent Incidents: +1 (auto-resolved in 35s)
[Screen 2: GitLab]
Issue #849 updated:
Status: Resolved
Comment added:
"Incident Resolved
Total Duration: 35 seconds
Recovery Method: Automated (Runbook oom-prevention-v1)
Root Cause: Memory limit insufficient
Preventive Measure: Increased limit 512MB -> 768MB"
Issue closed automatically.
Act 5: Proof & Differentiation (15 seconds)
Presenter: "That's it. 35 seconds from failure to fix-no human intervention.
Let's look at the audit trail."
[Screen 2: Click into GitLab issue #849]
Show:
- Complete timeline of agent actions
- Root cause analysis
- Remediation steps with outputs
- Success verification
Presenter: "Every action logged, every decision explained. This is what
governance-ready AIOps looks like."
Chaos Injection API
File: chaos-api/main.py
from fastapi import FastAPI
import subprocess
import asyncio
app = FastAPI(title="Chaos Engineering API")
@app.post("/chaos/memory-fill/{service}")
async def chaos_memory_fill(service: str, target_usage: float = 0.95):
"""Fill container memory to trigger OOM alert."""
cmd = f"""
docker exec {service} sh -c '
apt-get update && apt-get install -y stress
stress --vm 1 --vm-bytes $(awk "/MemAvailable/\{\{printf \"%d\\n\", \\$2 * 1024 * {target_usage\}\}}" < /proc/meminfo) --timeout 60s &
'
"""
subprocess.Popen(cmd, shell=True)
return {
"status": "injecting",
"service": service,
"type": "memory_fill",
"target_usage": f"{target_usage * 100}%"
}
@app.post("/chaos/kill/{service}")
async def chaos_kill(service: str):
"""Stop container to trigger restart alert."""
subprocess.run(["docker", "stop", service])
return {"status": "killed", "service": service}
@app.post("/chaos/latency/{service}")
async def chaos_latency(service: str, delay_ms: int = 1000):
"""Inject network latency."""
cmd = f"docker exec {service} tc qdisc add dev eth0 root netem delay {delay_ms}ms"
subprocess.run(cmd, shell=True)
return {"status": "latency_injected", "delay_ms": delay_ms}
@app.post("/chaos/clear/{service}")
async def chaos_clear(service: str):
"""Clear all chaos injections."""
subprocess.run(f"docker exec {service} pkill stress", shell=True)
subprocess.run(f"docker exec {service} tc qdisc del dev eth0 root", shell=True)
return {"status": "cleared", "service": service}
Implementation Roadmap
Phase 1: Foundation (Week 1)
Objective: Core observability + event pipeline working
Tasks:
- Deploy Redpanda (Kafka)
- Build Kafka bridge (FastAPI webhook -> Redpanda)
- Configure Grafana contact points (webhook to bridge)
- Create 5 baseline alert rules (tier 0)
- Test end-to-end: Alert fires -> Kafka topic populated
Deliverable: Alert can trigger Kafka message
Validation:
# Fire test alert
curl -X POST http://grafana:3000/api/v1/alerts/test
# Verify Kafka message
kafka-console-consumer --bootstrap-server redpanda:9092 --topic sre-critical --from-beginning
Phase 2: Agent Core (Week 2)
Objective: Minimal agent can consume events and log to GitLab
Tasks:
- Set up GitLab project:
aiops-incidents - Implement GitLab MCP client
- Build LangGraph agent skeleton
- Kafka consumer loop
- Deploy agent container
Deliverable: Alert -> GitLab issue created
Phase 3: Context Enrichment (Week 2-3)
Objective: Agent queries Grafana for logs/metrics
Tasks:
- Implement Grafana MCP client (Loki + Prometheus)
- Add
enrich_contextnode to agent - Add
analyze_root_causenode (LLM integration) - Update GitLab issues with analysis
Deliverable: Agent posts analysis to GitLab
Phase 4: Runbook System (Week 3)
Objective: Agent can execute runbooks
Tasks:
- Define runbook schema (DIS)
- Create 3 runbooks
- Implement RunbookRegistry
- Add
match_runbooknode - Add
execute_remediationnode (Docker MCP) - Implement Docker MCP client
Deliverable: Agent executes runbook, remediates issue
Phase 5: Demo Polish (Week 4)
Objective: Production-ready demo
Tasks:
- Build chaos injection API
- Create demo dashboards
- Add SSE stream for agent thoughts
- Build simple UI for agent activity feed
- Implement escalation flow
- Add Slack notifications (tier 2 alerts)
- Record demo video (2-minute loop)
- Create sales deck with screenshots
Deliverable: Repeatable 2-minute demo
Validation Checklist:
- Can inject chaos via API
- Alert fires within 10 seconds
- GitLab issue created within 15 seconds
- Remediation completes within 60 seconds
- Issue auto-closed with RCA
- Dashboard shows incident in history
- Can repeat demo 3x without manual cleanup
Phase 6: Voice Integration (Week 5)
Objective: Add voice-triggered incidents (bonus)
Tasks:
- Implement Voice Gateway service
- Integrate with LiveKit
- Add Whisper STT for voice commands
- Voice demo: "Hey Ravenhelm, kill Redis"
Deliverable: Voice-controlled chaos demo
Success Metrics
Technical Metrics
| Metric | Target |
|---|---|
| MTTR | <60 seconds for tier 0 alerts |
| Detection Latency | <10 seconds |
| Investigation Duration | <30 seconds |
| Remediation Success Rate | >85% for known patterns |
| Escalation Rate | <20% of incidents |
Demo Metrics
| Metric | Target |
|---|---|
| Time to Wow | <30 seconds |
| Demo Success Rate | 95% (no failures in front of prospects) |
| Repeatability | 5x in a row without intervention |
Docker Compose (AIOps Stack)
File: docker-compose.aiops.yml
version: '3.8'
services:
redpanda:
image: vectorized/redpanda:latest
container_name: redpanda
command:
- redpanda start
- --smp 1
- --memory 1G
- --overprovisioned
- --kafka-addr PLAINTEXT://0.0.0.0:9092
- --advertise-kafka-addr PLAINTEXT://redpanda:9092
ports:
- "9092:9092"
- "9644:9644"
volumes:
- redpanda-data:/var/lib/redpanda/data
networks:
- ravenhelm
kafka-bridge:
build:
context: ./kafka-bridge
dockerfile: Dockerfile
container_name: kafka-bridge
environment:
KAFKA_BROKER: redpanda:9092
ports:
- "8080:8080"
depends_on:
- redpanda
networks:
- ravenhelm
sre-agent:
build:
context: ./sre-agent
dockerfile: Dockerfile
container_name: sre-agent
environment:
KAFKA_BROKER: redpanda:9092
GRAFANA_URL: http://grafana:3000
GRAFANA_TOKEN: ${GRAFANA_API_TOKEN}
GITLAB_URL: https://gitlab.ravenhelm.dev
GITLAB_TOKEN: ${GITLAB_API_TOKEN}
GITLAB_PROJECT_ID: aiops-incidents
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./runbooks:/app/runbooks:ro
depends_on:
- redpanda
- kafka-bridge
networks:
- ravenhelm
chaos-api:
build:
context: ./chaos-api
dockerfile: Dockerfile
container_name: chaos-api
ports:
- "8081:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- ravenhelm
volumes:
redpanda-data:
networks:
ravenhelm:
external: true
Environment Variables
# Grafana
GRAFANA_API_TOKEN=your_grafana_token_here
# GitLab
GITLAB_API_TOKEN=your_gitlab_token_here
GITLAB_PROJECT_ID=aiops-incidents
# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here
# Slack (optional)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# PagerDuty (optional)
PAGERDUTY_INTEGRATION_KEY=your_pagerduty_key_here
Return to: [[AIOps-Platform]] - Main documentation index