Skip to main content

AIOps Architecture Overview

[[TOC]]

System Topology

┌─────────────────────────────────────────────────────────────────────┐
│ Ravenhelm Platform │
│ 38 Routers | 26 Services | 5 Middleware | Multi-tenant Identity │
└────────────────────────────┬────────────────────────────────────────┘

┌────────────────────┼────────────────────┐
│ │ │
┌───────▼────────┐ ┌────────▼────────┐ ┌───────▼────────┐
│ Application │ │ Observability │ │ AIOps Agent │
│ Layer │ │ Stack │ │ Layer │
│ │ │ │ │ │
│ • Services (26)│ │ • Alloy/Promtail│ │ • LangGraph │
│ • APIs │ │ • Loki │ │ • Kafka Bridge │
│ • Voice (3) │ │ • Prometheus │ │ • GitLab MCP │
│ • Home (3) │ │ • Tempo │ │ • Docker MCP │
│ │ │ • Grafana │ │ • Runbook Exec │
└────────┬───────┘ └────────┬────────┘ └───────┬────────┘
│ │ │
└───────────────────┼────────────────────┘

┌────────▼─────────┐
│ Event Bus │
│ (Redpanda) │
│ │
│ Topics: │
│ • sre-events │
│ • agent-thoughts │
│ • agent-actions │
└──────────────────┘

Data Flow: Alert to Resolution

1. Service Failure

2. Logs → Loki (via Alloy)
Metrics → Prometheus

3. Grafana Alert Fires

4. Webhook → Kafka Bridge → Redpanda (sre-events topic)

5. Agent Consumes Event
├─→ Create GitLab Issue (incident #XXX)
├─→ Query Loki for logs (context enrichment)
├─→ Query Prometheus for metrics

6. LLM Analyzes Context
├─→ Update GitLab Issue (root cause analysis)

7. Match Runbook from Registry

8. Execute Remediation Steps
├─→ Update GitLab Issue (each step logged)
├─→ Publish agent-thoughts to SSE stream

9. Verify Success
├─→ Success: Close GitLab Issue with RCA
└─→ Failure: Escalate GitLab Issue, assign oncall

Technology Stack

LayerTechnologiesPurpose
InfrastructureDocker, Traefik, PostgreSQL, Redis, RedpandaCore platform services
Identity/AuthZZitadel, SPIRE, OpenFGA, OAuth2 ProxyAuthentication & authorization
ObservabilityGrafana, Loki, Prometheus, Tempo, Alloy, Uptime KumaMetrics, logs, traces
AI/AgentsLangGraph, Ollama (local), Claude API, LangfuseAgent framework & LLM
Automationn8n, Agent FoundryWorkflow automation
VoiceLiveKit, Whisper, Piper, Voice GatewayReal-time voice processing
DevOpsGitLab, MCP ServersSource control, CI/CD
Event BusRedpanda (Kafka-compatible)Event streaming
LanguagesPython 3.11+, TypeScript (MCP servers)Implementation

Component Interactions

Observability → Agent Flow

┌────────────────────────────────────────────────────────────┐
│ Application Services │
│ (Traefik, GitLab, LangGraph, Ollama, LiveKit, etc.) │
└──────────┬─────────────────────────────────────┬───────────┘
│ │
│ Logs │ Metrics
▼ ▼
┌──────────────────────┐ ┌─────────────────────┐
│ Alloy (Pipeline) │ │ Prometheus │
│ │ │ │
│ • Parse logs │ │ • Scrape exporters │
│ • Enrich metadata │ │ • Evaluate rules │
│ • Route to Loki │ │ • Store metrics │
└──────────┬───────────┘ └──────────┬──────────┘
│ │
▼ │
┌──────────────────────┐ │
│ Loki │ │
│ │ │
│ • Index labels │ │
│ • Store log chunks │ │
│ • Query engine │ │
└──────────┬───────────┘ │
│ │
└──────────────┬──────────────────────┘


┌─────────────────────┐
│ Grafana │
│ │
│ • Dashboards │
│ • Alert rules │
│ • Contact points │
└─────────┬───────────┘

│ Webhook (on alert)

┌─────────────────────┐
│ Kafka Bridge │
│ (FastAPI) │
└─────────┬───────────┘


┌─────────────────────┐
│ Redpanda │
│ (Event Bus) │
│ │
│ Topics: │
│ • sre-critical │
│ • sre-warning │
│ • sre-info │
└─────────────────────┘

Deployment Architecture

Docker Compose Structure

ravenhelm/
├── docker-compose.core.yml # Existing services
├── docker-compose.observability.yml # Grafana/Loki/Prometheus
├── docker-compose.aiops.yml # New AIOps stack
│ ├── redpanda
│ ├── kafka-bridge
│ ├── sre-agent
│ └── chaos-api
└── docker-compose.override.yml # Local dev overrides

Next: [[AIOps-Service-Inventory]] - Detailed service catalog