Debug Failing Service

Troubleshoot and fix container issues using voice, chat, or CLI.

Overview

When a service fails, you can diagnose and fix it through:

Voice/Chat: Ask Norns to check container status and logs
CLI: Direct Docker commands
Grafana: Visual dashboards and alerts

Via Voice or Chat

Check Container Status

You: "Is postgres healthy?"

Norns: "Let me check... The postgres container is running and healthy. It has been up for 5 days. CPU usage is at 3% and memory at 512MB."

Check All Containers

You: "Are any containers unhealthy?"

Norns: "I found 2 containers with issues. The spire-agent container is in a restart loop with exit code 1. The uptime-kuma container is running but not responding to health checks."

Query Logs

You: "Show me the recent errors from norns-agent"

Norns: "Here are the last 5 errors from norns-agent:

10:23 - ConnectionError: Redis connection refused
10:23 - RetryError: Max retries exceeded
10:24 - ConnectionError: Redis connection refused ..."

Ask for Help

You: "Why is redis not working?"

Norns: "Let me investigate... Redis appears to have stopped. The last log entry shows an out-of-memory error. The container exited 10 minutes ago. Would you like me to restart it?"

You: "Yes, restart redis"

Norns: "I have restarted the redis container. It is now running and healthy. The connected services should recover automatically."

Via CLI

Quick Health Check

# SSH to odin
ssh ravenhelm@100.115.101.81

# Check all containers
docker ps --format "table {{.Names}}\t{{.Status}}" | grep -v "healthy"

# Check specific container
docker inspect postgres --format "{{.State.Health.Status}}"

View Logs

# Last 50 lines
docker logs norns-agent --tail=50

# Follow logs in real-time
docker logs -f norns-agent

# Filter for errors
docker logs norns-agent 2>&1 | grep -i error | tail -20

Restart Service

# Simple restart
docker restart norns-agent

# Full restart via compose
cd ~/ravenhelm/docs/AI-ML-Platform/norns-agent
docker-compose restart

# Recreate container
docker-compose up -d --force-recreate

Check Resources

# Container resource usage
docker stats --no-stream

# System disk space
df -h

# Docker disk usage
docker system df

Common Issues and Solutions

Container in Restart Loop

Symptoms: Status shows "Restarting" repeatedly

Diagnosis:

# Check exit code
docker inspect <container> --format "{{.State.ExitCode}}"

# View last logs before crash
docker logs <container> --tail=100

Common Causes:

Missing environment variables
Database connection failed
Port already in use
Out of memory

Solution:

# Check environment
docker exec <container> env | grep -i password

# Check port conflicts
lsof -i :<port>

# Increase memory (in docker-compose.yml)
deploy:
  resources:
    limits:
      memory: 4G

Database Connection Failed

Symptoms: "Connection refused" or "FATAL: password authentication failed"

Diagnosis:

# Check postgres is running
docker ps | grep postgres

# Test connection
docker exec -it postgres psql -U ravenhelm -c "SELECT 1"

Solution:

# Restart postgres
docker restart postgres

# Wait for healthy
docker inspect postgres --format "{{.State.Health.Status}}"

# Restart dependent services
docker restart norns-agent bifrost-api

Out of Memory

Symptoms: Container killed, exit code 137

Diagnosis:

# Check container memory limit
docker inspect <container> --format "{{.HostConfig.Memory}}"

# Check host memory
free -h

# Check Docker events
docker events --since 24h --filter "event=oom"

Solution:

# Increase container limit
# In docker-compose.yml:
deploy:
  resources:
    limits:
      memory: 8G

# Or increase Colima VM
colima stop
colima start --memory 24 --cpu 8

Network Issues

Symptoms: "Name or service not known" or connection timeouts

Diagnosis:

# Check network exists
docker network ls | grep ravenhelm_net

# Check container is on network
docker inspect <container> --format "{{.NetworkSettings.Networks}}"

# Test DNS resolution
docker exec traefik nslookup norns-agent

Solution:

# Reconnect to network
docker network connect ravenhelm_net <container>

# Restart with network
docker-compose down && docker-compose up -d

Tool Reference

container_status

{
  "name": "container_status",
  "domain": "observability",
  "input_schema": {
    "container_name": "string (optional, all if omitted)",
    "include_resources": "boolean (default true)"
  }
}

query_logs

{
  "name": "query_logs",
  "domain": "observability",
  "input_schema": {
    "container": "string (required)",
    "level": "error|warn|info|debug (optional)",
    "since": "duration like 1h, 30m (optional)",
    "limit": "number (default 50)"
  }
}

Escalation Path

Self-Service: Ask Norns or check Grafana
CLI Debug: SSH and run Docker commands
Restart Service: docker-compose restart
Check Dependencies: Verify postgres, redis are healthy
Full Restart: docker-compose down && up -d
Check Logs: Review for root cause
Restore Backup: If data corruption

Overview​

Via Voice or Chat​

Check Container Status​

Check All Containers​

Query Logs​

Ask for Help​

Via CLI​

Quick Health Check​

View Logs​

Restart Service​

Check Resources​

Common Issues and Solutions​

Container in Restart Loop​

Database Connection Failed​

Out of Memory​

Network Issues​

Tool Reference​

container_status​

query_logs​

Escalation Path​

See Also​

Overview

Via Voice or Chat

Check Container Status

Check All Containers

Query Logs

Ask for Help

Via CLI

Quick Health Check

View Logs

Restart Service

Check Resources

Common Issues and Solutions

Container in Restart Loop

Database Connection Failed

Out of Memory

Network Issues

Tool Reference

container_status

query_logs

Escalation Path

See Also