Skip to main content

Demo Choreography & Implementation Roadmap

[[TOC]]

Demo Script (2-Minute Full Cycle)

Objective: Show autonomous incident detection -> investigation -> remediation -> GitLab audit trail

Prerequisites

  • Grafana dashboard open on screen 1
  • GitLab issues board open on screen 2
  • Terminal with chaos injection API ready

Act 1: Baseline (30 seconds)

[Screen 1: Platform Health Dashboard]
Presenter: "This is Ravenhelm, our AI-powered platform. Everything's green-
26 services running, zero incidents in the last hour."

[Show KPI panel]
- Services: 26/26 (100%)
- Active Incidents: 0
- MTTR: 43s avg

Act 2: Chaos Injection (15 seconds)

[Terminal]
Presenter: "Let me introduce a realistic failure-Redis memory pressure."

Command:
$ curl -X POST http://localhost:8080/chaos/memory-fill/docs/infrastructure/redis

Output:
{
"status": "injecting",
"service": "redis",
"type": "memory_fill",
"target_usage": "95%"
}

[Screen 1: Dashboard updates]
- Redis memory usage spikes to 94%
- Error rate panel shows uptick
- Alert fires: "HighMemoryUsage"

Act 3: Agent Detection & Investigation (45 seconds)

[Screen 1: Agent Activity Panel]
Agent reasoning stream appears:

[11:43:21] Alert received: High memory usage - Redis
[11:43:24] Querying Loki for Redis logs (last 5 minutes)...
[11:43:27] Querying Prometheus for memory metrics...
[11:43:32] Analysis complete
Root cause: Memory limit insufficient for current load
Confidence: 92%
Recommended action: Increase memory limit
[11:43:35] Matched runbook: oom-prevention-v1
[11:43:37] Executing: Increase Redis memory limit (512MB -> 768MB)

[Screen 2: GitLab]
New issue appears:
#849 - Incident: HighMemoryUsage - Redis
Status: Investigating
Labels: incident, severity::warning, service::redis, aiops-managed

Timeline updates in real-time with agent actions.

Act 4: Remediation & Resolution (30 seconds)

[Screen 1: Agent Activity Panel]
[11:43:42] Waiting for container to stabilize...
[11:43:52] Health check passed
[11:43:54] Verifying success: Memory usage now at 68%
[11:43:56] Incident resolved autonomously

[Screen 1: Dashboard]
- Services: 26/26 (still 100%)
- Active Incidents: 0
- Recent Incidents: +1 (auto-resolved in 35s)

[Screen 2: GitLab]
Issue #849 updated:
Status: Resolved
Comment added:
"Incident Resolved
Total Duration: 35 seconds
Recovery Method: Automated (Runbook oom-prevention-v1)
Root Cause: Memory limit insufficient
Preventive Measure: Increased limit 512MB -> 768MB"

Issue closed automatically.

Act 5: Proof & Differentiation (15 seconds)

Presenter: "That's it. 35 seconds from failure to fix-no human intervention.
Let's look at the audit trail."

[Screen 2: Click into GitLab issue #849]

Show:
- Complete timeline of agent actions
- Root cause analysis
- Remediation steps with outputs
- Success verification

Presenter: "Every action logged, every decision explained. This is what
governance-ready AIOps looks like."

Chaos Injection API

File: chaos-api/main.py

from fastapi import FastAPI
import subprocess
import asyncio

app = FastAPI(title="Chaos Engineering API")

@app.post("/chaos/memory-fill/{service}")
async def chaos_memory_fill(service: str, target_usage: float = 0.95):
"""Fill container memory to trigger OOM alert."""
cmd = f"""
docker exec {service} sh -c '
apt-get update && apt-get install -y stress
stress --vm 1 --vm-bytes $(awk "/MemAvailable/\{\{printf \"%d\\n\", \\$2 * 1024 * {target_usage\}\}}" < /proc/meminfo) --timeout 60s &
'
"""

subprocess.Popen(cmd, shell=True)

return {
"status": "injecting",
"service": service,
"type": "memory_fill",
"target_usage": f"{target_usage * 100}%"
}

@app.post("/chaos/kill/{service}")
async def chaos_kill(service: str):
"""Stop container to trigger restart alert."""
subprocess.run(["docker", "stop", service])
return {"status": "killed", "service": service}

@app.post("/chaos/latency/{service}")
async def chaos_latency(service: str, delay_ms: int = 1000):
"""Inject network latency."""
cmd = f"docker exec {service} tc qdisc add dev eth0 root netem delay {delay_ms}ms"
subprocess.run(cmd, shell=True)
return {"status": "latency_injected", "delay_ms": delay_ms}

@app.post("/chaos/clear/{service}")
async def chaos_clear(service: str):
"""Clear all chaos injections."""
subprocess.run(f"docker exec {service} pkill stress", shell=True)
subprocess.run(f"docker exec {service} tc qdisc del dev eth0 root", shell=True)
return {"status": "cleared", "service": service}

Implementation Roadmap

Phase 1: Foundation (Week 1)

Objective: Core observability + event pipeline working

Tasks:

  • Deploy Redpanda (Kafka)
  • Build Kafka bridge (FastAPI webhook -> Redpanda)
  • Configure Grafana contact points (webhook to bridge)
  • Create 5 baseline alert rules (tier 0)
  • Test end-to-end: Alert fires -> Kafka topic populated

Deliverable: Alert can trigger Kafka message

Validation:

# Fire test alert
curl -X POST http://grafana:3000/api/v1/alerts/test

# Verify Kafka message
kafka-console-consumer --bootstrap-server redpanda:9092 --topic sre-critical --from-beginning

Phase 2: Agent Core (Week 2)

Objective: Minimal agent can consume events and log to GitLab

Tasks:

  • Set up GitLab project: aiops-incidents
  • Implement GitLab MCP client
  • Build LangGraph agent skeleton
  • Kafka consumer loop
  • Deploy agent container

Deliverable: Alert -> GitLab issue created


Phase 3: Context Enrichment (Week 2-3)

Objective: Agent queries Grafana for logs/metrics

Tasks:

  • Implement Grafana MCP client (Loki + Prometheus)
  • Add enrich_context node to agent
  • Add analyze_root_cause node (LLM integration)
  • Update GitLab issues with analysis

Deliverable: Agent posts analysis to GitLab


Phase 4: Runbook System (Week 3)

Objective: Agent can execute runbooks

Tasks:

  • Define runbook schema (DIS)
  • Create 3 runbooks
  • Implement RunbookRegistry
  • Add match_runbook node
  • Add execute_remediation node (Docker MCP)
  • Implement Docker MCP client

Deliverable: Agent executes runbook, remediates issue


Phase 5: Demo Polish (Week 4)

Objective: Production-ready demo

Tasks:

  • Build chaos injection API
  • Create demo dashboards
  • Add SSE stream for agent thoughts
  • Build simple UI for agent activity feed
  • Implement escalation flow
  • Add Slack notifications (tier 2 alerts)
  • Record demo video (2-minute loop)
  • Create sales deck with screenshots

Deliverable: Repeatable 2-minute demo

Validation Checklist:

  • Can inject chaos via API
  • Alert fires within 10 seconds
  • GitLab issue created within 15 seconds
  • Remediation completes within 60 seconds
  • Issue auto-closed with RCA
  • Dashboard shows incident in history
  • Can repeat demo 3x without manual cleanup

Phase 6: Voice Integration (Week 5)

Objective: Add voice-triggered incidents (bonus)

Tasks:

  • Implement Voice Gateway service
  • Integrate with LiveKit
  • Add Whisper STT for voice commands
  • Voice demo: "Hey Ravenhelm, kill Redis"

Deliverable: Voice-controlled chaos demo


Success Metrics

Technical Metrics

MetricTarget
MTTR<60 seconds for tier 0 alerts
Detection Latency<10 seconds
Investigation Duration<30 seconds
Remediation Success Rate>85% for known patterns
Escalation Rate<20% of incidents

Demo Metrics

MetricTarget
Time to Wow<30 seconds
Demo Success Rate95% (no failures in front of prospects)
Repeatability5x in a row without intervention

Docker Compose (AIOps Stack)

File: docker-compose.aiops.yml

version: '3.8'

services:
redpanda:
image: vectorized/redpanda:latest
container_name: redpanda
command:
- redpanda start
- --smp 1
- --memory 1G
- --overprovisioned
- --kafka-addr PLAINTEXT://0.0.0.0:9092
- --advertise-kafka-addr PLAINTEXT://redpanda:9092
ports:
- "9092:9092"
- "9644:9644"
volumes:
- redpanda-data:/var/lib/redpanda/data
networks:
- ravenhelm

kafka-bridge:
build:
context: ./kafka-bridge
dockerfile: Dockerfile
container_name: kafka-bridge
environment:
KAFKA_BROKER: redpanda:9092
ports:
- "8080:8080"
depends_on:
- redpanda
networks:
- ravenhelm

sre-agent:
build:
context: ./sre-agent
dockerfile: Dockerfile
container_name: sre-agent
environment:
KAFKA_BROKER: redpanda:9092
GRAFANA_URL: http://grafana:3000
GRAFANA_TOKEN: ${GRAFANA_API_TOKEN}
GITLAB_URL: https://gitlab.ravenhelm.dev
GITLAB_TOKEN: ${GITLAB_API_TOKEN}
GITLAB_PROJECT_ID: aiops-incidents
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./runbooks:/app/runbooks:ro
depends_on:
- redpanda
- kafka-bridge
networks:
- ravenhelm

chaos-api:
build:
context: ./chaos-api
dockerfile: Dockerfile
container_name: chaos-api
ports:
- "8081:8080"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- ravenhelm

volumes:
redpanda-data:

networks:
ravenhelm:
external: true

Environment Variables

# Grafana
GRAFANA_API_TOKEN=your_grafana_token_here

# GitLab
GITLAB_API_TOKEN=your_gitlab_token_here
GITLAB_PROJECT_ID=aiops-incidents

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here

# Slack (optional)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# PagerDuty (optional)
PAGERDUTY_INTEGRATION_KEY=your_pagerduty_key_here

Return to: [[AIOps-Platform]] - Main documentation index