AIOps Architecture Overview
[[TOC]]
System Topology
┌─────────────────────────────────────────────────────────────────────┐
│ Ravenhelm Platform │
│ 38 Routers | 26 Services | 5 Middleware | Multi-tenant Identity │
└────────────────────────────┬────────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌───────▼────────┐ ┌────────▼────────┐ ┌───────▼────────┐
│ Application │ │ Observability │ │ AIOps Agent │
│ Layer │ │ Stack │ │ Layer │
│ │ │ │ │ │
│ • Services (26)│ │ • Alloy/Promtail│ │ • LangGraph │
│ • APIs │ │ • Loki │ │ • Kafka Bridge │
│ • Voice (3) │ │ • Prometheus │ │ • GitLab MCP │
│ • Home (3) │ │ • Tempo │ │ • Docker MCP │
│ │ │ • Grafana │ │ • Runbook Exec │
└────────┬───────┘ └────────┬────────┘ └───────┬────────┘
│ │ │
└───────────────────┼────────────────────┘
│
┌────────▼─────────┐
│ Event Bus │
│ (Redpanda) │
│ │
│ Topics: │
│ • sre-events │
│ • agent-thoughts │
│ • agent-actions │
└──────────────────┘
Data Flow: Alert to Resolution
1. Service Failure
↓
2. Logs → Loki (via Alloy)
Metrics → Prometheus
↓
3. Grafana Alert Fires
↓
4. Webhook → Kafka Bridge → Redpanda (sre-events topic)
↓
5. Agent Consumes Event
├─→ Create GitLab Issue (incident #XXX)
├─→ Query Loki for logs (context enrichment)
├─→ Query Prometheus for metrics
↓
6. LLM Analyzes Context
├─→ Update GitLab Issue (root cause analysis)
↓
7. Match Runbook from Registry
↓
8. Execute Remediation Steps
├─→ Update GitLab Issue (each step logged)
├─→ Publish agent-thoughts to SSE stream
↓
9. Verify Success
├─→ Success: Close GitLab Issue with RCA
└─→ Failure: Escalate GitLab Issue, assign oncall
Technology Stack
| Layer | Technologies | Purpose |
|---|---|---|
| Infrastructure | Docker, Traefik, PostgreSQL, Redis, Redpanda | Core platform services |
| Identity/AuthZ | Zitadel, SPIRE, OpenFGA, OAuth2 Proxy | Authentication & authorization |
| Observability | Grafana, Loki, Prometheus, Tempo, Alloy, Uptime Kuma | Metrics, logs, traces |
| AI/Agents | LangGraph, Ollama (local), Claude API, Langfuse | Agent framework & LLM |
| Automation | n8n, Agent Foundry | Workflow automation |
| Voice | LiveKit, Whisper, Piper, Voice Gateway | Real-time voice processing |
| DevOps | GitLab, MCP Servers | Source control, CI/CD |
| Event Bus | Redpanda (Kafka-compatible) | Event streaming |
| Languages | Python 3.11+, TypeScript (MCP servers) | Implementation |
Component Interactions
Observability → Agent Flow
┌────────────────────────────────────────────────────────────┐
│ Application Services │
│ (Traefik, GitLab, LangGraph, Ollama, LiveKit, etc.) │
└──────────┬─────────────────────────────────────┬───────────┘
│ │
│ Logs │ Metrics
▼ ▼
┌──────────────────────┐ ┌─────────────────────┐
│ Alloy (Pipeline) │ │ Prometheus │
│ │ │ │
│ • Parse logs │ │ • Scrape exporters │
│ • Enrich metadata │ │ • Evaluate rules │
│ • Route to Loki │ │ • Store metrics │
└──────────┬───────────┘ └──────────┬──────────┘
│ │
▼ │
┌──────────────────────┐ │
│ Loki │ │
│ │ │
│ • Index labels │ │
│ • Store log chunks │ │
│ • Query engine │ │
└──────────┬───────────┘ │
│ │
└──────────────┬──────────────────────┘
│
▼
┌─────────────────────┐
│ Grafana │
│ │
│ • Dashboards │
│ • Alert rules │
│ • Contact points │
└─────────┬───────────┘
│
│ Webhook (on alert)
▼
┌─────────────────────┐
│ Kafka Bridge │
│ (FastAPI) │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Redpanda │
│ (Event Bus) │
│ │
│ Topics: │
│ • sre-critical │
│ • sre-warning │
│ • sre-info │
└─────────────────────┘
Deployment Architecture
Docker Compose Structure
ravenhelm/
├── docker-compose.core.yml # Existing services
├── docker-compose.observability.yml # Grafana/Loki/Prometheus
├── docker-compose.aiops.yml # New AIOps stack
│ ├── redpanda
│ ├── kafka-bridge
│ ├── sre-agent
│ └── chaos-api
└── docker-compose.override.yml # Local dev overrides
Next: [[AIOps-Service-Inventory]] - Detailed service catalog