Skip to main content

Ravenhelm AIOps Platform

Version: 1.0.0
Status: Production (Phase 1 Complete)
Owner: Nate Walker


Executive Summary

This documentation outlines the complete architecture for transforming Ravenhelm from a personal homelab into a demonstrable AI-powered self-healing platform. The system autonomously detects, investigates, and remediates incidents while maintaining full audit trails in GitLab.

Implementation Status

PhaseStatusDescription
Database Schema✅ Complete18 AIOps tables in PostgreSQL
Alert Engine✅ CompleteGrafana/Alertmanager webhook ingestion
CMDB Discovery✅ CompleteDocker, Prometheus, Traefik auto-discovery
Recommendations✅ CompleteAuto-suggested monitoring per entity type
GitLab Integration✅ CompleteIncident tracking with issue creation
Admin UI✅ CompleteFull dashboard in Bifrost
Workflow Automation🔄 In Progressn8n workflow orchestration
Self-Healing📋 PlannedAutomated remediation runbooks

Core Value Proposition

  • Autonomous incident response in <60 seconds
  • Full GitLab integration for compliance/audit trails
  • Transparent AI reasoning visible in real-time
  • Extensible runbook system via Domain Intelligence Schema
  • Production-ready telephony integration (AudioHook/Genesys)

Live System Overview

Current Inventory

MetricCount
Docker Containers Discovered52
Prometheus Targets6
Traefik Services29
Registered Agents3
Monitoring Recommendations4
Entity Types6

Access Points

ServiceURL
AIOps Dashboardhttps://bifrost.ravenhelm.dev/aiops
CMDB Browserhttps://bifrost.ravenhelm.dev/cmdb
Recommendationshttps://bifrost.ravenhelm.dev/recommendations
Agentshttps://bifrost.ravenhelm.dev/agents
APIhttps://bifrost-api.ravenhelm.dev

Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│ Bifrost AIOps Module │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Alert Engine │ │ CMDB │ │ GitLab │ │
│ │ │ │ │ │ Integration │ │
│ │ • Grafana hooks │ │ • Docker disc. │ │ │ │
│ │ • Alertmanager │ │ • Prometheus │ │ • Issue creation │ │
│ │ • Routing rules │ │ • Traefik │ │ • Timeline audit │ │
│ │ • State mgmt │ │ • Recommendations│ │ • Auto-resolve │ │
│ └────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │ │
│ └──────────────────────┼──────────────────────┘ │
│ │ │
│ ┌───────▼───────┐ │
│ │ PostgreSQL │ │
│ │ ravenmaskos │ │
│ └───────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
Grafana/Alertmanager Docker/Prometheus/Traefik GitLab CE

Documentation Index

SectionDescription
[[Infrastructure/Bifrost]]Bifrost service documentation (API, Admin UI, Config)
[[AIOps-Architecture]]System topology, data flow, technology stack
[[AIOps-Service-Inventory]]26 services across 7 categories with log taxonomy
[[AIOps-Observability-Stack]]Alloy, Loki, Prometheus, Grafana configuration
[[AIOps-Dashboard-Strategy]]Dashboard hierarchy, panels, and queries
[[AIOps-Alert-Rules]]Tier 0/1/2 alerts with routing policies
[[AIOps-Agent-Architecture]]LangGraph state machine, Kafka consumer
[[AIOps-GitLab-Integration]]MCP client, incident tracking, audit trails
[[AIOps-Runbook-Registry]]DIS schema, runbook examples
[[AIOps-Demo-Roadmap]]Demo choreography, implementation phases

API Endpoints

Alert Management

# List alerts
GET /api/v1/aiops/alerts

# Acknowledge alert
POST /api/v1/aiops/alerts/{id}/acknowledge

# Resolve alert
POST /api/v1/aiops/alerts/{id}/resolve

CMDB Operations

# List entities
GET /api/v1/cmdb/entities?type=container

# Get entity recommendations
GET /api/v1/cmdb/entities/{id}/recommendations

# Apply recommendation
POST /api/v1/cmdb/entities/{id}/recommendations/{rec_id}/apply

# Trigger discovery
POST /api/v1/cmdb/discovery/trigger

AIOps Overview

# Combined stats, alerts, incidents, executions
GET /api/v1/aiops/overview

# GitLab integration config
GET /api/v1/aiops/gitlab/config

Quick Reference

Technology Stack

LayerTechnologies
InfrastructureDocker, Traefik, PostgreSQL, Redis
Identity/AuthZZitadel, SPIRE, OpenFGA
ObservabilityGrafana, Loki, Prometheus, Tempo, Alloy
AI/AgentsLangGraph, Ollama, Claude API, Langfuse
Automationn8n, Bifrost API
DevOpsGitLab, MCP Servers

Implementation Timeline

PhaseFocusStatus
1Database Schema & Core Services✅ Complete
2Discovery Engine (Docker/Prom/Traefik)✅ Complete
3Admin UI Dashboard✅ Complete
4Workflow Automation (n8n)🔄 In Progress
5Self-Healing Runbooks📋 Planned
6Voice Integration📋 Planned

Success Metrics

  • MTTR: <60 seconds for tier 0 alerts (target)
  • Detection Latency: <10 seconds (target)
  • Discovery Coverage: 100% of Docker/Prometheus/Traefik
  • Recommendations Applied: 4 built-in templates