Host High CPU Runbook
Alert Details
- Alert Name: HostHighCpuUsage / HostCriticalCpuUsage
- Severity: Warning (>80%) / Critical (>95%)
- Metrics:
node_cpu_seconds_total
Initial Assessment
1. Verify Current CPU Usage
# Check CPU usage from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=100-(avg(rate(node_cpu_seconds_total{mode='idle'}[5m]))*100)"
# On host directly (via colima ssh)
colima ssh -- top -bn1 | head -20
2. Identify Top CPU Consumers
# Check container CPU usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" | sort -k2 -rh | head -15
# Check process CPU on host
colima ssh -- ps aux --sort=-%cpu | head -15
Common Causes
Container Runaway Process
- Identify the high-CPU container from
docker stats - Check container logs:
docker logs <container> --tail 100 - If safe to restart:
docker restart <container>
Build/Compilation Job
- Check if CI/CD build is running:
docker ps | grep -i build - If expected, wait for completion
- If stuck, check runner logs
Database Query Load
- Check PostgreSQL active queries:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; - Kill long-running queries if needed:
SELECT pg_cancel_backend(pid);
Memory Pressure (CPU in swap)
- Check if swapping:
colima ssh -- free -h - If high swap usage, see Host-High-Memory runbook
Remediation Steps
Safe Actions (Automated)
- Restart non-critical containers (nginx, redis, grafana)
- Scale down non-essential workloads
- Clear caches if memory pressure is contributing
Requires Human Approval
- Restart database containers
- Kill active database connections
- Scale down production services
Escalation Criteria
- CPU > 95% for more than 10 minutes
- Multiple hosts affected
- Critical services degraded
- Unknown root cause after initial investigation
Post-Incident
- Document root cause in incident timeline
- Add preventive alerts if new pattern identified
- Update this runbook if new resolution steps discovered