Demo Choreography & Implementation Roadmap

[[TOC]]

Demo Script (2-Minute Full Cycle)

Objective: Show autonomous incident detection -> investigation -> remediation -> GitLab audit trail

Prerequisites

Grafana dashboard open on screen 1
GitLab issues board open on screen 2
Terminal with chaos injection API ready

Act 1: Baseline (30 seconds)

[Screen 1: Platform Health Dashboard]
Presenter: "This is Ravenhelm, our AI-powered platform. Everything's green-
26 services running, zero incidents in the last hour."

[Show KPI panel]
- Services: 26/26 (100%)
- Active Incidents: 0
- MTTR: 43s avg

Act 2: Chaos Injection (15 seconds)

[Terminal]
Presenter: "Let me introduce a realistic failure-Redis memory pressure."

Command:
$ curl -X POST http://localhost:8080/chaos/memory-fill/docs/infrastructure/redis

Output:
{
  "status": "injecting",
  "service": "redis",
  "type": "memory_fill",
  "target_usage": "95%"
}

[Screen 1: Dashboard updates]
- Redis memory usage spikes to 94%
- Error rate panel shows uptick
- Alert fires: "HighMemoryUsage"

Act 3: Agent Detection & Investigation (45 seconds)

[Screen 1: Agent Activity Panel]
Agent reasoning stream appears:

[11:43:21] Alert received: High memory usage - Redis
[11:43:24] Querying Loki for Redis logs (last 5 minutes)...
[11:43:27] Querying Prometheus for memory metrics...
[11:43:32] Analysis complete
           Root cause: Memory limit insufficient for current load
           Confidence: 92%
           Recommended action: Increase memory limit
[11:43:35] Matched runbook: oom-prevention-v1
[11:43:37] Executing: Increase Redis memory limit (512MB -> 768MB)

[Screen 2: GitLab]
New issue appears:
#849 - Incident: HighMemoryUsage - Redis
Status: Investigating
Labels: incident, severity::warning, service::redis, aiops-managed

Timeline updates in real-time with agent actions.

Act 4: Remediation & Resolution (30 seconds)

[Screen 1: Agent Activity Panel]
[11:43:42] Waiting for container to stabilize...
[11:43:52] Health check passed
[11:43:54] Verifying success: Memory usage now at 68%
[11:43:56] Incident resolved autonomously

[Screen 1: Dashboard]
- Services: 26/26 (still 100%)
- Active Incidents: 0
- Recent Incidents: +1 (auto-resolved in 35s)

[Screen 2: GitLab]
Issue #849 updated:
Status: Resolved
Comment added:
"Incident Resolved
Total Duration: 35 seconds
Recovery Method: Automated (Runbook oom-prevention-v1)
Root Cause: Memory limit insufficient
Preventive Measure: Increased limit 512MB -> 768MB"

Issue closed automatically.

Act 5: Proof & Differentiation (15 seconds)

Presenter: "That's it. 35 seconds from failure to fix-no human intervention.
Let's look at the audit trail."

[Screen 2: Click into GitLab issue #849]

Show:
- Complete timeline of agent actions
- Root cause analysis
- Remediation steps with outputs
- Success verification

Presenter: "Every action logged, every decision explained. This is what 
governance-ready AIOps looks like."

Chaos Injection API

File: chaos-api/main.py

from fastapi import FastAPI
import subprocess
import asyncio

app = FastAPI(title="Chaos Engineering API")

@app.post("/chaos/memory-fill/{service}")
async def chaos_memory_fill(service: str, target_usage: float = 0.95):
    """Fill container memory to trigger OOM alert."""
    cmd = f"""
    docker exec {service} sh -c '
    apt-get update && apt-get install -y stress
    stress --vm 1 --vm-bytes $(awk "/MemAvailable/\{\{printf \"%d\\n\", \\$2 * 1024 * {target_usage\}\}}" < /proc/meminfo) --timeout 60s &
    '
    """
    
    subprocess.Popen(cmd, shell=True)
    
    return {
        "status": "injecting",
        "service": service,
        "type": "memory_fill",
        "target_usage": f"{target_usage * 100}%"
    }

@app.post("/chaos/kill/{service}")
async def chaos_kill(service: str):
    """Stop container to trigger restart alert."""
    subprocess.run(["docker", "stop", service])
    return {"status": "killed", "service": service}

@app.post("/chaos/latency/{service}")
async def chaos_latency(service: str, delay_ms: int = 1000):
    """Inject network latency."""
    cmd = f"docker exec {service} tc qdisc add dev eth0 root netem delay {delay_ms}ms"
    subprocess.run(cmd, shell=True)
    return {"status": "latency_injected", "delay_ms": delay_ms}

@app.post("/chaos/clear/{service}")
async def chaos_clear(service: str):
    """Clear all chaos injections."""
    subprocess.run(f"docker exec {service} pkill stress", shell=True)
    subprocess.run(f"docker exec {service} tc qdisc del dev eth0 root", shell=True)
    return {"status": "cleared", "service": service}

Implementation Roadmap

Phase 1: Foundation (Week 1)

Objective: Core observability + event pipeline working

Tasks:

Deploy Redpanda (Kafka)
Build Kafka bridge (FastAPI webhook -> Redpanda)
Configure Grafana contact points (webhook to bridge)
Create 5 baseline alert rules (tier 0)
Test end-to-end: Alert fires -> Kafka topic populated

Deliverable: Alert can trigger Kafka message

Validation:

# Fire test alert
curl -X POST http://grafana:3000/api/v1/alerts/test

# Verify Kafka message
kafka-console-consumer --bootstrap-server redpanda:9092 --topic sre-critical --from-beginning

Phase 2: Agent Core (Week 2)

Objective: Minimal agent can consume events and log to GitLab

Tasks:

Deliverable: Alert -> GitLab issue created

Phase 3: Context Enrichment (Week 2-3)

Objective: Agent queries Grafana for logs/metrics

Tasks:

Implement Grafana MCP client (Loki + Prometheus)
Add enrich_context node to agent
Add analyze_root_cause node (LLM integration)
Update GitLab issues with analysis

Deliverable: Agent posts analysis to GitLab

Phase 4: Runbook System (Week 3)

Objective: Agent can execute runbooks

Tasks:

Deliverable: Agent executes runbook, remediates issue

Phase 5: Demo Polish (Week 4)

Objective: Production-ready demo

Tasks:

Build chaos injection API
Create demo dashboards
Add SSE stream for agent thoughts
Build simple UI for agent activity feed
Implement escalation flow
Add Slack notifications (tier 2 alerts)
Record demo video (2-minute loop)
Create sales deck with screenshots

Deliverable: Repeatable 2-minute demo

Validation Checklist:

Can inject chaos via API
Alert fires within 10 seconds
GitLab issue created within 15 seconds
Remediation completes within 60 seconds
Issue auto-closed with RCA
Dashboard shows incident in history
Can repeat demo 3x without manual cleanup

Phase 6: Voice Integration (Week 5)

Objective: Add voice-triggered incidents (bonus)

Tasks:

Implement Voice Gateway service
Integrate with LiveKit
Add Whisper STT for voice commands
Voice demo: "Hey Ravenhelm, kill Redis"

Deliverable: Voice-controlled chaos demo

Success Metrics

Technical Metrics

Metric	Target
MTTR	`<60 seconds` for tier 0 alerts
Detection Latency	`<10 seconds`
Investigation Duration	`<30 seconds`
Remediation Success Rate	>85% for known patterns
Escalation Rate	`<20%` of incidents

Demo Metrics

Metric	Target
Time to Wow	`<30 seconds`
Demo Success Rate	95% (no failures in front of prospects)
Repeatability	5x in a row without intervention

Docker Compose (AIOps Stack)

File: docker-compose.aiops.yml

version: '3.8'

services:
  redpanda:
    image: vectorized/redpanda:latest
    container_name: redpanda
    command:
      - redpanda start
      - --smp 1
      - --memory 1G
      - --overprovisioned
      - --kafka-addr PLAINTEXT://0.0.0.0:9092
      - --advertise-kafka-addr PLAINTEXT://redpanda:9092
    ports:
      - "9092:9092"
      - "9644:9644"
    volumes:
      - redpanda-data:/var/lib/redpanda/data
    networks:
      - ravenhelm

  kafka-bridge:
    build:
      context: ./kafka-bridge
      dockerfile: Dockerfile
    container_name: kafka-bridge
    environment:
      KAFKA_BROKER: redpanda:9092
    ports:
      - "8080:8080"
    depends_on:
      - redpanda
    networks:
      - ravenhelm

  sre-agent:
    build:
      context: ./sre-agent
      dockerfile: Dockerfile
    container_name: sre-agent
    environment:
      KAFKA_BROKER: redpanda:9092
      GRAFANA_URL: http://grafana:3000
      GRAFANA_TOKEN: ${GRAFANA_API_TOKEN}
      GITLAB_URL: https://gitlab.ravenhelm.dev
      GITLAB_TOKEN: ${GITLAB_API_TOKEN}
      GITLAB_PROJECT_ID: aiops-incidents
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./runbooks:/app/runbooks:ro
    depends_on:
      - redpanda
      - kafka-bridge
    networks:
      - ravenhelm

  chaos-api:
    build:
      context: ./chaos-api
      dockerfile: Dockerfile
    container_name: chaos-api
    ports:
      - "8081:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - ravenhelm

volumes:
  redpanda-data:

networks:
  ravenhelm:
    external: true

Environment Variables

# Grafana
GRAFANA_API_TOKEN=your_grafana_token_here

# GitLab
GITLAB_API_TOKEN=your_gitlab_token_here
GITLAB_PROJECT_ID=aiops-incidents

# Anthropic
ANTHROPIC_API_KEY=your_anthropic_key_here

# Slack (optional)
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# PagerDuty (optional)
PAGERDUTY_INTEGRATION_KEY=your_pagerduty_key_here

Return to: [[AIOps-Platform]] - Main documentation index

Demo Script (2-Minute Full Cycle)​

Prerequisites​

Act 1: Baseline (30 seconds)​

Act 2: Chaos Injection (15 seconds)​

Act 3: Agent Detection & Investigation (45 seconds)​

Act 4: Remediation & Resolution (30 seconds)​

Act 5: Proof & Differentiation (15 seconds)​

Chaos Injection API​

Implementation Roadmap​

Phase 1: Foundation (Week 1)​

Phase 2: Agent Core (Week 2)​

Phase 3: Context Enrichment (Week 2-3)​

Phase 4: Runbook System (Week 3)​

Phase 5: Demo Polish (Week 4)​

Phase 6: Voice Integration (Week 5)​

Success Metrics​

Technical Metrics​

Demo Metrics​

Docker Compose (AIOps Stack)​

Environment Variables​

Demo Script (2-Minute Full Cycle)

Prerequisites

Act 1: Baseline (30 seconds)

Act 2: Chaos Injection (15 seconds)

Act 3: Agent Detection & Investigation (45 seconds)

Act 4: Remediation & Resolution (30 seconds)

Act 5: Proof & Differentiation (15 seconds)

Chaos Injection API

Implementation Roadmap

Phase 1: Foundation (Week 1)

Phase 2: Agent Core (Week 2)

Phase 3: Context Enrichment (Week 2-3)

Phase 4: Runbook System (Week 3)

Phase 5: Demo Polish (Week 4)

Phase 6: Voice Integration (Week 5)

Success Metrics

Technical Metrics

Demo Metrics

Docker Compose (AIOps Stack)

Environment Variables