Observability

Complete monitoring, logging, and tracing stack for RavenmaskOS.

Overview

The observability stack provides full visibility into system health, performance, and behavior.

┌──────────────────────────────────────────────────────────────────┐
│                        OBSERVABILITY                              │
├──────────────────────────────────────────────────────────────────┤
│                                                                   │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│   │   Grafana   │◀───│ Prometheus  │    │       Loki          │  │
│   │ (Dashboards)│    │  (Metrics)  │    │      (Logs)         │  │
│   └─────────────┘    └─────────────┘    └─────────────────────┘  │
│          ▲                  ▲                     ▲               │
│          │                  │                     │               │
│          │           ┌──────┴──────┐              │               │
│          │           │    Alloy    │──────────────┘               │
│          │           │ (Collector) │                              │
│          │           └──────┬──────┘                              │
│          │                  │                                     │
│          ▼                  ▼                                     │
│   ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│   │    Tempo    │    │  Exporters  │    │    Uptime Kuma      │  │
│   │  (Traces)   │    │ (node/pg/   │    │   (Status Page)     │  │
│   └─────────────┘    │  redis/cA)  │    └─────────────────────┘  │
│                      └─────────────┘                              │
└──────────────────────────────────────────────────────────────────┘

Services

Service	Purpose	URL
Grafana	Dashboards & visualization	grafana.ravenhelm.dev
Grafana Dashboard Catalog	Complete dashboard reference	-
Prometheus	Metrics collection	Internal (9090)
Loki	Log aggregation	Internal (3100)
Tempo	Distributed tracing	Internal (3200)
Alloy	Telemetry collector	Internal
Uptime Kuma	Status monitoring	status.ravenhelm.dev
Exporters	Metrics exporters	Various

Data Flow

Services → Alloy → Loki (logs)
                 → Prometheus (metrics)
                 → Tempo (traces)
                         ↓
                     Grafana
                         ↓
                  Alerts → n8n → Slack

Quick Access

Grafana Dashboards

Dashboard	Purpose
Infrastructure Overview	System health at a glance
Docker Containers	Container metrics
PostgreSQL	Database performance
Redis	Cache statistics
Traefik	Request metrics
Node Metrics	Host system stats

Common Queries

Loki (LogQL):

# All errors in last hour
{job="containers"} |= "error" | json

# Norns agent logs
{container="norns-agent"}

# Traefik access logs
{container="traefik"} | json | status >= 400

Prometheus (PromQL):

# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# HTTP request rate
rate(traefik_entrypoint_requests_total[5m])

Quick Commands

# Check all observability services
for svc in grafana prometheus loki tempo alloy uptime-kuma; do
  echo "=== $svc ==="
  docker ps --filter "name=$svc" --format "{{.Names}}: {{.Status}}"
done

# View Grafana logs
docker logs -f grafana

# Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq

# Query Loki
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq

Alerting

Alerts are configured in Grafana and routed to:

Slack (#alerts channel)
n8n workflows for escalation

Active Alert Rules

Rule	Severity	Condition
Service Down	Critical	Instance unreachable > 1m
High Memory	Warning	Container memory > 80%
High CPU	Warning	Container CPU > 90% for 5m
Disk Space	Warning	Disk usage > 85%
Certificate Expiry	Warning	Cert expires < 7 days

Overview​

Services​

Data Flow​

Quick Access​

Grafana Dashboards​

Common Queries​

Quick Commands​

Alerting​

Active Alert Rules​