Skip to main content

Runbook: Vidar Troubleshooting

Overview

  • What: Troubleshoot Vidar AIOps platform issues
  • When: Vidar API Down, Vidar Admin Unavailable, or alert processing issues
  • Duration: 5-30 minutes

Prerequisites

  • SSH access to odin (ssh ravenhelm@100.115.101.81)
  • Access to Grafana dashboards

Alert Types

AlertSeverityDescription
vidar-api-downCriticalVidar API not receiving traffic or >90% errors
vidar-admin-unavailableWarningVidar Admin dashboard not accessible

Quick Checks

Step 1: Check Container Status

# On odin as ravenhelm
docker ps -a | grep vidar

Expected output shows both containers running:

  • vidar-api
  • vidar-admin

Step 2: Check Service Health

# External health check
curl -s https://vidar-api.ravenhelm.dev/health | jq .
curl -s https://vidar-admin.ravenhelm.dev

# Internal health check (if Traefik is down)
docker exec vidar-api curl -s localhost:8000/health

Step 3: Check Recent Logs

# API logs
docker logs --tail 100 vidar-api

# Admin logs
docker logs --tail 100 vidar-admin

Common Issues

1. Container Not Running

Symptoms: Container in Exited state or not present

Solution:

cd ~/ravenhelm/services/vidar
docker compose up -d
docker compose logs -f

2. Database Connection Issues

Symptoms: Logs show database connection errors

Solution:

# Check PostgreSQL is running
docker exec postgres pg_isready

# Verify Vidar tables exist
docker exec postgres psql -U ravenhelm -d ravenmaskos -c "\dt vidar.*"

# Restart if connection pool exhausted
docker restart vidar-api

3. No Traffic Alert (False Positive)

Context: The vidar-api-down alert monitors traffic via Traefik metrics. It fires when there's no traffic for 10 minutes OR >90% error rate.

Important: The alert uses noDataState: OK so it won't fire when Prometheus has no data.

Check traffic metrics:

# Query Prometheus for Vidar API traffic
docker exec prometheus wget -qO- \
'http://localhost:9090/api/v1/query?query=sum(increase(traefik_service_requests_total{service="vidar-api@file"}[10m]))'

Note: Vidar API uses Traefik file-based routing (@file) not Docker labels (@docker). The service name in metrics is vidar-api@file.

4. Webhook Processing Failures

Symptoms: Alerts not being processed from Grafana/AlertManager

Solution:

# Check recent webhook logs
docker logs vidar-api 2>&1 | grep -i webhook | tail -20

# Verify webhook endpoint is accepting requests
curl -X POST https://vidar-api.ravenhelm.dev/webhooks/grafana \
-H "Content-Type: application/json" \
-d '{"test": true}'

5. LangFuse/OpenAI Integration Issues

Symptoms: Logs show LangFuse or OpenAI errors

Check:

# Verify environment variables are set
docker exec vidar-api env | grep -E "(LANGFUSE|OPENAI)"

Monitoring

Grafana Dashboards

Key Metrics

# Request rate
sum(rate(traefik_service_requests_total{service="vidar-api@file"}[5m]))

# Error rate
sum(rate(traefik_service_requests_total{service="vidar-api@file",code=~"5.."}[5m]))
/ sum(rate(traefik_service_requests_total{service="vidar-api@file"}[5m]))

# Latency (if available)
histogram_quantile(0.95, sum(rate(traefik_service_request_duration_seconds_bucket{service="vidar-api@file"}[5m])) by (le))

Verification

After remediation, verify:

# Service responding
curl -s https://vidar-api.ravenhelm.dev/health | jq .

# Traffic flowing (wait a minute)
docker exec prometheus wget -qO- \
'http://localhost:9090/api/v1/query?query=sum(increase(traefik_service_requests_total{service="vidar-api@file"}[1m]))'

# No firing alerts for vidar
docker exec alertmanager wget -qO- http://localhost:9093/api/v2/alerts

Change History

DateChangeTicket
2026-01-09Created runbook after RAV-259 investigationRAV-259
2026-01-09Note: vidar-api uses @file not @docker service nameRAV-259