Debug Failing Service
Troubleshoot and fix container issues using voice, chat, or CLI.
Overview
When a service fails, you can diagnose and fix it through:
- Voice/Chat: Ask Norns to check container status and logs
- CLI: Direct Docker commands
- Grafana: Visual dashboards and alerts
Via Voice or Chat
Check Container Status
You: "Is postgres healthy?"
Norns: "Let me check... The postgres container is running and healthy. It has been up for 5 days. CPU usage is at 3% and memory at 512MB."
Check All Containers
You: "Are any containers unhealthy?"
Norns: "I found 2 containers with issues. The spire-agent container is in a restart loop with exit code 1. The uptime-kuma container is running but not responding to health checks."
Query Logs
You: "Show me the recent errors from norns-agent"
Norns: "Here are the last 5 errors from norns-agent:
- 10:23 - ConnectionError: Redis connection refused
- 10:23 - RetryError: Max retries exceeded
- 10:24 - ConnectionError: Redis connection refused ..."
Ask for Help
You: "Why is redis not working?"
Norns: "Let me investigate... Redis appears to have stopped. The last log entry shows an out-of-memory error. The container exited 10 minutes ago. Would you like me to restart it?"
You: "Yes, restart redis"
Norns: "I have restarted the redis container. It is now running and healthy. The connected services should recover automatically."
Via CLI
Quick Health Check
# SSH to odin
ssh ravenhelm@100.115.101.81
# Check all containers
docker ps --format "table {{.Names}}\t{{.Status}}" | grep -v "healthy"
# Check specific container
docker inspect postgres --format "{{.State.Health.Status}}"
View Logs
# Last 50 lines
docker logs norns-agent --tail=50
# Follow logs in real-time
docker logs -f norns-agent
# Filter for errors
docker logs norns-agent 2>&1 | grep -i error | tail -20
Restart Service
# Simple restart
docker restart norns-agent
# Full restart via compose
cd ~/ravenhelm/docs/AI-ML-Platform/norns-agent
docker-compose restart
# Recreate container
docker-compose up -d --force-recreate
Check Resources
# Container resource usage
docker stats --no-stream
# System disk space
df -h
# Docker disk usage
docker system df
Common Issues and Solutions
Container in Restart Loop
Symptoms: Status shows "Restarting" repeatedly
Diagnosis:
# Check exit code
docker inspect <container> --format "{{.State.ExitCode}}"
# View last logs before crash
docker logs <container> --tail=100
Common Causes:
- Missing environment variables
- Database connection failed
- Port already in use
- Out of memory
Solution:
# Check environment
docker exec <container> env | grep -i password
# Check port conflicts
lsof -i :<port>
# Increase memory (in docker-compose.yml)
deploy:
resources:
limits:
memory: 4G
Database Connection Failed
Symptoms: "Connection refused" or "FATAL: password authentication failed"
Diagnosis:
# Check postgres is running
docker ps | grep postgres
# Test connection
docker exec -it postgres psql -U ravenhelm -c "SELECT 1"
Solution:
# Restart postgres
docker restart postgres
# Wait for healthy
docker inspect postgres --format "{{.State.Health.Status}}"
# Restart dependent services
docker restart norns-agent bifrost-api
Out of Memory
Symptoms: Container killed, exit code 137
Diagnosis:
# Check container memory limit
docker inspect <container> --format "{{.HostConfig.Memory}}"
# Check host memory
free -h
# Check Docker events
docker events --since 24h --filter "event=oom"
Solution:
# Increase container limit
# In docker-compose.yml:
deploy:
resources:
limits:
memory: 8G
# Or increase Colima VM
colima stop
colima start --memory 24 --cpu 8
Network Issues
Symptoms: "Name or service not known" or connection timeouts
Diagnosis:
# Check network exists
docker network ls | grep ravenhelm_net
# Check container is on network
docker inspect <container> --format "{{.NetworkSettings.Networks}}"
# Test DNS resolution
docker exec traefik nslookup norns-agent
Solution:
# Reconnect to network
docker network connect ravenhelm_net <container>
# Restart with network
docker-compose down && docker-compose up -d
Tool Reference
container_status
{
"name": "container_status",
"domain": "observability",
"input_schema": {
"container_name": "string (optional, all if omitted)",
"include_resources": "boolean (default true)"
}
}
query_logs
{
"name": "query_logs",
"domain": "observability",
"input_schema": {
"container": "string (required)",
"level": "error|warn|info|debug (optional)",
"since": "duration like 1h, 30m (optional)",
"limit": "number (default 50)"
}
}
Escalation Path
- Self-Service: Ask Norns or check Grafana
- CLI Debug: SSH and run Docker commands
- Restart Service: docker-compose restart
- Check Dependencies: Verify postgres, redis are healthy
- Full Restart: docker-compose down && up -d
- Check Logs: Review for root cause
- Restore Backup: If data corruption
See Also
- [[Query-Logs]] - Detailed log searching
- [[View-LLM-Traces]] - Debug AI issues
- [[Check-System-Health]] - Overall monitoring
- [[../Observability]] - Grafana dashboards