Skip to main content

Observability

Complete monitoring, logging, and tracing stack for RavenmaskOS.


Overview

The observability stack provides full visibility into system health, performance, and behavior.

┌──────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Grafana │◀───│ Prometheus │ │ Loki │ │
│ │ (Dashboards)│ │ (Metrics) │ │ (Logs) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ │
│ │ ┌──────┴──────┐ │ │
│ │ │ Alloy │──────────────┘ │
│ │ │ (Collector) │ │
│ │ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Tempo │ │ Exporters │ │ Uptime Kuma │ │
│ │ (Traces) │ │ (node/pg/ │ │ (Status Page) │ │
│ └─────────────┘ │ redis/cA) │ └─────────────────────┘ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Services

ServicePurposeURL
GrafanaDashboards & visualizationgrafana.ravenhelm.dev
Grafana Dashboard CatalogComplete dashboard reference-
PrometheusMetrics collectionInternal (9090)
LokiLog aggregationInternal (3100)
TempoDistributed tracingInternal (3200)
AlloyTelemetry collectorInternal
Uptime KumaStatus monitoringstatus.ravenhelm.dev
ExportersMetrics exportersVarious

Data Flow

Services → Alloy → Loki (logs)
→ Prometheus (metrics)
→ Tempo (traces)

Grafana

Alerts → n8n → Slack

Quick Access

Grafana Dashboards

DashboardPurpose
Infrastructure OverviewSystem health at a glance
Docker ContainersContainer metrics
PostgreSQLDatabase performance
RedisCache statistics
TraefikRequest metrics
Node MetricsHost system stats

Common Queries

Loki (LogQL):

# All errors in last hour
{job="containers"} |= "error" | json

# Norns agent logs
{container="norns-agent"}

# Traefik access logs
{container="traefik"} | json | status >= 400

Prometheus (PromQL):

# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes / container_spec_memory_limit_bytes

# HTTP request rate
rate(traefik_entrypoint_requests_total[5m])

Quick Commands

# Check all observability services
for svc in grafana prometheus loki tempo alloy uptime-kuma; do
echo "=== $svc ==="
docker ps --filter "name=$svc" --format "{{.Names}}: {{.Status}}"
done

# View Grafana logs
docker logs -f grafana

# Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq

# Query Loki
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq

Alerting

Alerts are configured in Grafana and routed to:

  • Slack (#alerts channel)
  • n8n workflows for escalation

Active Alert Rules

RuleSeverityCondition
Service DownCriticalInstance unreachable > 1m
High MemoryWarningContainer memory > 80%
High CPUWarningContainer CPU > 90% for 5m
Disk SpaceWarningDisk usage > 85%
Certificate ExpiryWarningCert expires < 7 days