Runbook: Vidar Troubleshooting
Overview
- What: Troubleshoot Vidar AIOps platform issues
- When: Vidar API Down, Vidar Admin Unavailable, or alert processing issues
- Duration: 5-30 minutes
Prerequisites
- SSH access to odin (
ssh ravenhelm@100.115.101.81) - Access to Grafana dashboards
Alert Types
| Alert | Severity | Description |
|---|---|---|
vidar-api-down | Critical | Vidar API not receiving traffic or >90% errors |
vidar-admin-unavailable | Warning | Vidar Admin dashboard not accessible |
Quick Checks
Step 1: Check Container Status
# On odin as ravenhelm
docker ps -a | grep vidar
Expected output shows both containers running:
vidar-apividar-admin
Step 2: Check Service Health
# External health check
curl -s https://vidar-api.ravenhelm.dev/health | jq .
curl -s https://vidar-admin.ravenhelm.dev
# Internal health check (if Traefik is down)
docker exec vidar-api curl -s localhost:8000/health
Step 3: Check Recent Logs
# API logs
docker logs --tail 100 vidar-api
# Admin logs
docker logs --tail 100 vidar-admin
Common Issues
1. Container Not Running
Symptoms: Container in Exited state or not present
Solution:
cd ~/ravenhelm/services/vidar
docker compose up -d
docker compose logs -f
2. Database Connection Issues
Symptoms: Logs show database connection errors
Solution:
# Check PostgreSQL is running
docker exec postgres pg_isready
# Verify Vidar tables exist
docker exec postgres psql -U ravenhelm -d ravenmaskos -c "\dt vidar.*"
# Restart if connection pool exhausted
docker restart vidar-api
3. No Traffic Alert (False Positive)
Context: The vidar-api-down alert monitors traffic via Traefik metrics. It fires when there's no traffic for 10 minutes OR >90% error rate.
Important: The alert uses noDataState: OK so it won't fire when Prometheus has no data.
Check traffic metrics:
# Query Prometheus for Vidar API traffic
docker exec prometheus wget -qO- \
'http://localhost:9090/api/v1/query?query=sum(increase(traefik_service_requests_total{service="vidar-api@file"}[10m]))'
Note: Vidar API uses Traefik file-based routing (@file) not Docker labels (@docker). The service name in metrics is vidar-api@file.
4. Webhook Processing Failures
Symptoms: Alerts not being processed from Grafana/AlertManager
Solution:
# Check recent webhook logs
docker logs vidar-api 2>&1 | grep -i webhook | tail -20
# Verify webhook endpoint is accepting requests
curl -X POST https://vidar-api.ravenhelm.dev/webhooks/grafana \
-H "Content-Type: application/json" \
-d '{"test": true}'
5. LangFuse/OpenAI Integration Issues
Symptoms: Logs show LangFuse or OpenAI errors
Check:
# Verify environment variables are set
docker exec vidar-api env | grep -E "(LANGFUSE|OPENAI)"
Monitoring
Grafana Dashboards
- URL: https://grafana.ravenhelm.dev
- Dashboard: Vidar Overview
Key Metrics
# Request rate
sum(rate(traefik_service_requests_total{service="vidar-api@file"}[5m]))
# Error rate
sum(rate(traefik_service_requests_total{service="vidar-api@file",code=~"5.."}[5m]))
/ sum(rate(traefik_service_requests_total{service="vidar-api@file"}[5m]))
# Latency (if available)
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket{service="vidar-api@file"}[5m])) by (le))
Verification
After remediation, verify:
# Service responding
curl -s https://vidar-api.ravenhelm.dev/health | jq .
# Traffic flowing (wait a minute)
docker exec prometheus wget -qO- \
'http://localhost:9090/api/v1/query?query=sum(increase(traefik_service_requests_total{service="vidar-api@file"}[1m]))'
# No firing alerts for vidar
docker exec alertmanager wget -qO- http://localhost:9093/api/v2/alerts
Related Runbooks
- Service Down - Generic service recovery
- Vidar Database Column Mismatch
- Vidar Discovery Duplicate Key
- Vidar OpenAI Rate Limiting
Change History
| Date | Change | Ticket |
|---|---|---|
| 2026-01-09 | Created runbook after RAV-259 investigation | RAV-259 |
| 2026-01-09 | Note: vidar-api uses @file not @docker service name | RAV-259 |