Skip to main content

Debug Failing Service

Troubleshoot and fix container issues using voice, chat, or CLI.


Overview

When a service fails, you can diagnose and fix it through:

  • Voice/Chat: Ask Norns to check container status and logs
  • CLI: Direct Docker commands
  • Grafana: Visual dashboards and alerts

Via Voice or Chat

Check Container Status

You: "Is postgres healthy?"

Norns: "Let me check... The postgres container is running and healthy. It has been up for 5 days. CPU usage is at 3% and memory at 512MB."


Check All Containers

You: "Are any containers unhealthy?"

Norns: "I found 2 containers with issues. The spire-agent container is in a restart loop with exit code 1. The uptime-kuma container is running but not responding to health checks."


Query Logs

You: "Show me the recent errors from norns-agent"

Norns: "Here are the last 5 errors from norns-agent:

  1. 10:23 - ConnectionError: Redis connection refused
  2. 10:23 - RetryError: Max retries exceeded
  3. 10:24 - ConnectionError: Redis connection refused ..."

Ask for Help

You: "Why is redis not working?"

Norns: "Let me investigate... Redis appears to have stopped. The last log entry shows an out-of-memory error. The container exited 10 minutes ago. Would you like me to restart it?"

You: "Yes, restart redis"

Norns: "I have restarted the redis container. It is now running and healthy. The connected services should recover automatically."


Via CLI

Quick Health Check

# SSH to odin
ssh ravenhelm@100.115.101.81

# Check all containers
docker ps --format "table {{.Names}}\t{{.Status}}" | grep -v "healthy"

# Check specific container
docker inspect postgres --format "{{.State.Health.Status}}"

View Logs

# Last 50 lines
docker logs norns-agent --tail=50

# Follow logs in real-time
docker logs -f norns-agent

# Filter for errors
docker logs norns-agent 2>&1 | grep -i error | tail -20

Restart Service

# Simple restart
docker restart norns-agent

# Full restart via compose
cd ~/ravenhelm/docs/AI-ML-Platform/norns-agent
docker-compose restart

# Recreate container
docker-compose up -d --force-recreate

Check Resources

# Container resource usage
docker stats --no-stream

# System disk space
df -h

# Docker disk usage
docker system df

Common Issues and Solutions

Container in Restart Loop

Symptoms: Status shows "Restarting" repeatedly

Diagnosis:

# Check exit code
docker inspect <container> --format "{{.State.ExitCode}}"

# View last logs before crash
docker logs <container> --tail=100

Common Causes:

  • Missing environment variables
  • Database connection failed
  • Port already in use
  • Out of memory

Solution:

# Check environment
docker exec <container> env | grep -i password

# Check port conflicts
lsof -i :<port>

# Increase memory (in docker-compose.yml)
deploy:
resources:
limits:
memory: 4G

Database Connection Failed

Symptoms: "Connection refused" or "FATAL: password authentication failed"

Diagnosis:

# Check postgres is running
docker ps | grep postgres

# Test connection
docker exec -it postgres psql -U ravenhelm -c "SELECT 1"

Solution:

# Restart postgres
docker restart postgres

# Wait for healthy
docker inspect postgres --format "{{.State.Health.Status}}"

# Restart dependent services
docker restart norns-agent bifrost-api

Out of Memory

Symptoms: Container killed, exit code 137

Diagnosis:

# Check container memory limit
docker inspect <container> --format "{{.HostConfig.Memory}}"

# Check host memory
free -h

# Check Docker events
docker events --since 24h --filter "event=oom"

Solution:

# Increase container limit
# In docker-compose.yml:
deploy:
resources:
limits:
memory: 8G

# Or increase Colima VM
colima stop
colima start --memory 24 --cpu 8

Network Issues

Symptoms: "Name or service not known" or connection timeouts

Diagnosis:

# Check network exists
docker network ls | grep ravenhelm_net

# Check container is on network
docker inspect <container> --format "{{.NetworkSettings.Networks}}"

# Test DNS resolution
docker exec traefik nslookup norns-agent

Solution:

# Reconnect to network
docker network connect ravenhelm_net <container>

# Restart with network
docker-compose down && docker-compose up -d

Tool Reference

container_status

{
"name": "container_status",
"domain": "observability",
"input_schema": {
"container_name": "string (optional, all if omitted)",
"include_resources": "boolean (default true)"
}
}

query_logs

{
"name": "query_logs",
"domain": "observability",
"input_schema": {
"container": "string (required)",
"level": "error|warn|info|debug (optional)",
"since": "duration like 1h, 30m (optional)",
"limit": "number (default 50)"
}
}

Escalation Path

  1. Self-Service: Ask Norns or check Grafana
  2. CLI Debug: SSH and run Docker commands
  3. Restart Service: docker-compose restart
  4. Check Dependencies: Verify postgres, redis are healthy
  5. Full Restart: docker-compose down && up -d
  6. Check Logs: Review for root cause
  7. Restore Backup: If data corruption

See Also

  • [[Query-Logs]] - Detailed log searching
  • [[View-LLM-Traces]] - Debug AI issues
  • [[Check-System-Health]] - Overall monitoring
  • [[../Observability]] - Grafana dashboards