Host High Load Runbook
Alert Details
- Alert Name: HostHighLoad
- Severity: Warning
- Metrics:
node_load5(5-minute load average)
Initial Assessment
1. Check Current Load
# From Prometheus
curl -s "http://localhost:9090/api/v1/query?query=node_load5"
# On host
colima ssh -- uptime
colima ssh -- cat /proc/loadavg
2. Understand the Numbers
- Load = average number of processes waiting for CPU or I/O
- Rule of thumb: Load per CPU > 1.0 means processes are waiting
- Our alert fires when load per CPU > 2.0 for 5 minutes
3. Check CPU Count
colima ssh -- nproc
# Or from Prometheus:
curl -s "http://localhost:9090/api/v1/query?query=count(node_cpu_seconds_total{mode='idle'})"
Identify the Cause
CPU-Bound Load
# High %cpu in top
colima ssh -- top -bn1 | head -20
# Container CPU usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" | sort -k2 -rh
I/O-Bound Load
# Check I/O wait
colima ssh -- iostat -x 1 5
# Check disk latency from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=rate(node_disk_io_time_seconds_total[5m])"
# Find I/O heavy processes
colima ssh -- iotop -btoqqq -n 5
Memory Pressure (Swapping)
# Check swap activity
colima ssh -- vmstat 1 5
# If si/so columns show activity, memory is the issue
Common Causes
Container Build Jobs
- Check for running builds:
docker ps | grep build - Wait for completion or cancel if stuck
Database Heavy Queries
- Check PostgreSQL:
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
Disk I/O Bottleneck
- Usually caused by:
- Database queries
- Large file operations
- Container logging
- See Host-High-Disk runbook for cleanup
Too Many Concurrent Containers
- Check container count:
docker ps | wc -l - Consider stopping non-essential containers
Remediation Steps
Safe Actions (Automated)
- Restart containers showing high CPU in
docker stats - Clear Redis cache to reduce I/O
- Stop non-essential background jobs
Requires Human Approval
- Restart database containers
- Kill database queries
- Reduce parallelism in CI/CD jobs
Escalation Criteria
- Load > 4x CPU count sustained
- I/O wait > 50%
- Critical services unresponsive
- Root cause unclear
Post-Incident
- Identify which process/container caused load spike
- Consider resource limits for containers
- Review CI/CD parallelism settings
- Check if Colima needs more CPU allocation