Ravenhelm AIOps Platform
Version: 1.0.0
Status: Production (Phase 1 Complete)
Owner: Nate Walker
Executive Summary
This documentation outlines the complete architecture for transforming Ravenhelm from a personal homelab into a demonstrable AI-powered self-healing platform. The system autonomously detects, investigates, and remediates incidents while maintaining full audit trails in GitLab.
Implementation Status
| Phase | Status | Description |
|---|---|---|
| Database Schema | ✅ Complete | 18 AIOps tables in PostgreSQL |
| Alert Engine | ✅ Complete | Grafana/Alertmanager webhook ingestion |
| CMDB Discovery | ✅ Complete | Docker, Prometheus, Traefik auto-discovery |
| Recommendations | ✅ Complete | Auto-suggested monitoring per entity type |
| GitLab Integration | ✅ Complete | Incident tracking with issue creation |
| Admin UI | ✅ Complete | Full dashboard in Bifrost |
| Workflow Automation | 🔄 In Progress | n8n workflow orchestration |
| Self-Healing | 📋 Planned | Automated remediation runbooks |
Core Value Proposition
- Autonomous incident response in
<60 seconds - Full GitLab integration for compliance/audit trails
- Transparent AI reasoning visible in real-time
- Extensible runbook system via Domain Intelligence Schema
- Production-ready telephony integration (AudioHook/Genesys)
Live System Overview
Current Inventory
| Metric | Count |
|---|---|
| Docker Containers Discovered | 52 |
| Prometheus Targets | 6 |
| Traefik Services | 29 |
| Registered Agents | 3 |
| Monitoring Recommendations | 4 |
| Entity Types | 6 |
Access Points
| Service | URL |
|---|---|
| AIOps Dashboard | https://bifrost.ravenhelm.dev/aiops |
| CMDB Browser | https://bifrost.ravenhelm.dev/cmdb |
| Recommendations | https://bifrost.ravenhelm.dev/recommendations |
| Agents | https://bifrost.ravenhelm.dev/agents |
| API | https://bifrost-api.ravenhelm.dev |
Architecture
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Bifrost AIOps Module │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Alert Engine │ │ CMDB │ │ GitLab │ │
│ │ │ │ │ │ Integration │ │
│ │ • Grafana hooks │ │ • Docker disc. │ │ │ │
│ │ • Alertmanager │ │ • Prometheus │ │ • Issue creation │ │
│ │ • Routing rules │ │ • Traefik │ │ • Timeline audit │ │
│ │ • State mgmt │ │ • Recommendations│ │ • Auto-resolve │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ PostgreSQL │ │
│ │ ravenmaskos │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
Grafana/Alertmanager Docker/Prometheus/Traefik GitLab CE
Documentation Index
| Section | Description |
|---|---|
| [[Infrastructure/Bifrost]] | Bifrost service documentation (API, Admin UI, Config) |
| [[AIOps-Architecture]] | System topology, data flow, technology stack |
| [[AIOps-Service-Inventory]] | 26 services across 7 categories with log taxonomy |
| [[AIOps-Observability-Stack]] | Alloy, Loki, Prometheus, Grafana configuration |
| [[AIOps-Dashboard-Strategy]] | Dashboard hierarchy, panels, and queries |
| [[AIOps-Alert-Rules]] | Tier 0/1/2 alerts with routing policies |
| [[AIOps-Agent-Architecture]] | LangGraph state machine, Kafka consumer |
| [[AIOps-GitLab-Integration]] | MCP client, incident tracking, audit trails |
| [[AIOps-Runbook-Registry]] | DIS schema, runbook examples |
| [[AIOps-Demo-Roadmap]] | Demo choreography, implementation phases |
API Endpoints
Alert Management
# List alerts
GET /api/v1/aiops/alerts
# Acknowledge alert
POST /api/v1/aiops/alerts/{id}/acknowledge
# Resolve alert
POST /api/v1/aiops/alerts/{id}/resolve
CMDB Operations
# List entities
GET /api/v1/cmdb/entities?type=container
# Get entity recommendations
GET /api/v1/cmdb/entities/{id}/recommendations
# Apply recommendation
POST /api/v1/cmdb/entities/{id}/recommendations/{rec_id}/apply
# Trigger discovery
POST /api/v1/cmdb/discovery/trigger
AIOps Overview
# Combined stats, alerts, incidents, executions
GET /api/v1/aiops/overview
# GitLab integration config
GET /api/v1/aiops/gitlab/config
Quick Reference
Technology Stack
| Layer | Technologies |
|---|---|
| Infrastructure | Docker, Traefik, PostgreSQL, Redis |
| Identity/AuthZ | Zitadel, SPIRE, OpenFGA |
| Observability | Grafana, Loki, Prometheus, Tempo, Alloy |
| AI/Agents | LangGraph, Ollama, Claude API, Langfuse |
| Automation | n8n, Bifrost API |
| DevOps | GitLab, MCP Servers |
Implementation Timeline
| Phase | Focus | Status |
|---|---|---|
| 1 | Database Schema & Core Services | ✅ Complete |
| 2 | Discovery Engine (Docker/Prom/Traefik) | ✅ Complete |
| 3 | Admin UI Dashboard | ✅ Complete |
| 4 | Workflow Automation (n8n) | 🔄 In Progress |
| 5 | Self-Healing Runbooks | 📋 Planned |
| 6 | Voice Integration | 📋 Planned |
Success Metrics
- MTTR:
<60 secondsfor tier 0 alerts (target) - Detection Latency:
<10 seconds(target) - Discovery Coverage: 100% of Docker/Prometheus/Traefik
- Recommendations Applied: 4 built-in templates