Runbook: Service Down
Overview
- What: Investigate and restore a down service
- When: Uptime Kuma alert or user report
- Duration: 5-30 minutes
Prerequisites
- SSH access to odin
- Access to Grafana/logs
Procedure
Step 1: Identify the Issue
# Check container status
docker ps -a | grep <service-name>
# Possible statuses:
# - Exited: Container crashed
# - Restarting: Crash loop
# - Not found: Container removed
Step 2: Check Logs
docker logs --tail 100 <service-name>
Step 3: Check Resources
# Disk space
df -h ~/ravenhelm/data/<service-name>
# Memory
docker stats --no-stream
Step 4: Restart Service
cd ~/ravenhelm/services/<service-name>
docker compose restart
Step 5: If Restart Fails
# Full recreate
docker compose down
docker compose up -d
# Check logs
docker compose logs -f
Step 6: Check Dependencies
# Verify PostgreSQL
docker exec postgres pg_isready
# Verify Redis
docker exec redis redis-cli -a $REDIS_PASSWORD ping
Verification
# Container running
docker ps | grep <service-name>
# Service responding
curl -I https://<service-name>.ravenhelm.dev
Escalation
If service won't recover:
- Check Grafana for system-wide issues
- Review recent changes
- Consider rollback
- Check disaster recovery options