Observability
Complete monitoring, logging, and tracing stack for RavenmaskOS.
Overview
The observability stack provides full visibility into system health, performance, and behavior.
┌──────────────────────────────────────────────────────────────────┐
│ OBSERVABILITY │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Grafana │◀───│ Prometheus │ │ Loki │ │
│ │ (Dashboards)│ │ (Metrics) │ │ (Logs) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ ▲ ▲ ▲ │
│ │ │ │ │
│ │ ┌──────┴──────┐ │ │
│ │ │ Alloy │──────────────┘ │
│ │ │ (Collector) │ │
│ │ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Tempo │ │ Exporters │ │ Uptime Kuma │ │
│ │ (Traces) │ │ (node/pg/ │ │ (Status Page) │ │
│ └─────────────┘ │ redis/cA) │ └─────────────────────┘ │
│ └─────────────┘ │
└──────────────────────────────────────────────────────────────────┘
Services
| Service | Purpose | URL |
|---|---|---|
| Grafana | Dashboards & visualization | grafana.ravenhelm.dev |
| Grafana Dashboard Catalog | Complete dashboard reference | - |
| Prometheus | Metrics collection | Internal (9090) |
| Loki | Log aggregation | Internal (3100) |
| Tempo | Distributed tracing | Internal (3200) |
| Alloy | Telemetry collector | Internal |
| Uptime Kuma | Status monitoring | status.ravenhelm.dev |
| Exporters | Metrics exporters | Various |
Data Flow
Services → Alloy → Loki (logs)
→ Prometheus (metrics)
→ Tempo (traces)
↓
Grafana
↓
Alerts → n8n → Slack
Quick Access
Grafana Dashboards
| Dashboard | Purpose |
|---|---|
| Infrastructure Overview | System health at a glance |
| Docker Containers | Container metrics |
| PostgreSQL | Database performance |
| Redis | Cache statistics |
| Traefik | Request metrics |
| Node Metrics | Host system stats |
Common Queries
Loki (LogQL):
# All errors in last hour
{job="containers"} |= "error" | json
# Norns agent logs
{container="norns-agent"}
# Traefik access logs
{container="traefik"} | json | status >= 400
Prometheus (PromQL):
# Container CPU usage
rate(container_cpu_usage_seconds_total[5m])
# Memory usage
container_memory_usage_bytes / container_spec_memory_limit_bytes
# HTTP request rate
rate(traefik_entrypoint_requests_total[5m])
Quick Commands
# Check all observability services
for svc in grafana prometheus loki tempo alloy uptime-kuma; do
echo "=== $svc ==="
docker ps --filter "name=$svc" --format "{{.Names}}: {{.Status}}"
done
# View Grafana logs
docker logs -f grafana
# Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq
# Query Loki
curl -s 'http://localhost:3100/loki/api/v1/labels' | jq
Alerting
Alerts are configured in Grafana and routed to:
- Slack (#alerts channel)
- n8n workflows for escalation
Active Alert Rules
| Rule | Severity | Condition |
|---|---|---|
| Service Down | Critical | Instance unreachable > 1m |
| High Memory | Warning | Container memory > 80% |
| High CPU | Warning | Container CPU > 90% for 5m |
| Disk Space | Warning | Disk usage > 85% |
| Certificate Expiry | Warning | Cert expires < 7 days |