Host High Load Runbook

Alert Details

Alert Name: HostHighLoad
Severity: Warning
Metrics: node_load5 (5-minute load average)

Initial Assessment

1. Check Current Load

# From Prometheus
curl -s "http://localhost:9090/api/v1/query?query=node_load5"

# On host
colima ssh -- uptime
colima ssh -- cat /proc/loadavg

2. Understand the Numbers

Load = average number of processes waiting for CPU or I/O
Rule of thumb: Load per CPU > 1.0 means processes are waiting
Our alert fires when load per CPU > 2.0 for 5 minutes

3. Check CPU Count

colima ssh -- nproc
# Or from Prometheus:
curl -s "http://localhost:9090/api/v1/query?query=count(node_cpu_seconds_total{mode='idle'})"

Identify the Cause

CPU-Bound Load

# High %cpu in top
colima ssh -- top -bn1 | head -20

# Container CPU usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" | sort -k2 -rh

I/O-Bound Load

# Check I/O wait
colima ssh -- iostat -x 1 5

# Check disk latency from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=rate(node_disk_io_time_seconds_total[5m])"

# Find I/O heavy processes
colima ssh -- iotop -btoqqq -n 5

Memory Pressure (Swapping)

# Check swap activity
colima ssh -- vmstat 1 5

# If si/so columns show activity, memory is the issue

Common Causes

Container Build Jobs

Check for running builds: docker ps | grep build
Wait for completion or cancel if stuck

Database Heavy Queries

Check PostgreSQL:

SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;

Disk I/O Bottleneck

Usually caused by:
- Database queries
- Large file operations
- Container logging
See Host-High-Disk runbook for cleanup

Too Many Concurrent Containers

Check container count: docker ps | wc -l
Consider stopping non-essential containers

Remediation Steps

Safe Actions (Automated)

Restart containers showing high CPU in docker stats
Clear Redis cache to reduce I/O
Stop non-essential background jobs

Requires Human Approval

Restart database containers
Kill database queries
Reduce parallelism in CI/CD jobs

Escalation Criteria

Load > 4x CPU count sustained
I/O wait > 50%
Critical services unresponsive
Root cause unclear

Post-Incident

Identify which process/container caused load spike
Consider resource limits for containers
Review CI/CD parallelism settings
Check if Colima needs more CPU allocation

Alert Details​

Initial Assessment​

1. Check Current Load​

2. Understand the Numbers​

3. Check CPU Count​

Identify the Cause​

CPU-Bound Load​

I/O-Bound Load​

Memory Pressure (Swapping)​

Common Causes​

Container Build Jobs​

Database Heavy Queries​

Disk I/O Bottleneck​

Too Many Concurrent Containers​

Remediation Steps​

Safe Actions (Automated)​

Requires Human Approval​

Escalation Criteria​

Post-Incident​

Alert Details

Initial Assessment

1. Check Current Load

2. Understand the Numbers

3. Check CPU Count

Identify the Cause

CPU-Bound Load

I/O-Bound Load

Memory Pressure (Swapping)

Common Causes

Container Build Jobs

Database Heavy Queries

Disk I/O Bottleneck

Too Many Concurrent Containers

Remediation Steps

Safe Actions (Automated)

Requires Human Approval

Escalation Criteria

Post-Incident