Skip to main content

Runbook: Service Down

Overview

  • What: Investigate and restore a down service
  • When: Uptime Kuma alert or user report
  • Duration: 5-30 minutes

Prerequisites

  • SSH access to odin
  • Access to Grafana/logs

Procedure

Step 1: Identify the Issue

# Check container status
docker ps -a | grep <service-name>

# Possible statuses:
# - Exited: Container crashed
# - Restarting: Crash loop
# - Not found: Container removed

Step 2: Check Logs

docker logs --tail 100 <service-name>

Step 3: Check Resources

# Disk space
df -h ~/ravenhelm/data/<service-name>

# Memory
docker stats --no-stream

Step 4: Restart Service

cd ~/ravenhelm/services/<service-name>
docker compose restart

Step 5: If Restart Fails

# Full recreate
docker compose down
docker compose up -d

# Check logs
docker compose logs -f

Step 6: Check Dependencies

# Verify PostgreSQL
docker exec postgres pg_isready

# Verify Redis
docker exec redis redis-cli -a $REDIS_PASSWORD ping

Verification

# Container running
docker ps | grep <service-name>

# Service responding
curl -I https://<service-name>.ravenhelm.dev

Escalation

If service won't recover:

  1. Check Grafana for system-wide issues
  2. Review recent changes
  3. Consider rollback
  4. Check disaster recovery options