Skip to main content

Host High Load Runbook

Alert Details

  • Alert Name: HostHighLoad
  • Severity: Warning
  • Metrics: node_load5 (5-minute load average)

Initial Assessment

1. Check Current Load

# From Prometheus
curl -s "http://localhost:9090/api/v1/query?query=node_load5"

# On host
colima ssh -- uptime
colima ssh -- cat /proc/loadavg

2. Understand the Numbers

  • Load = average number of processes waiting for CPU or I/O
  • Rule of thumb: Load per CPU > 1.0 means processes are waiting
  • Our alert fires when load per CPU > 2.0 for 5 minutes

3. Check CPU Count

colima ssh -- nproc
# Or from Prometheus:
curl -s "http://localhost:9090/api/v1/query?query=count(node_cpu_seconds_total{mode='idle'})"

Identify the Cause

CPU-Bound Load

# High %cpu in top
colima ssh -- top -bn1 | head -20

# Container CPU usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" | sort -k2 -rh

I/O-Bound Load

# Check I/O wait
colima ssh -- iostat -x 1 5

# Check disk latency from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=rate(node_disk_io_time_seconds_total[5m])"

# Find I/O heavy processes
colima ssh -- iotop -btoqqq -n 5

Memory Pressure (Swapping)

# Check swap activity
colima ssh -- vmstat 1 5

# If si/so columns show activity, memory is the issue

Common Causes

Container Build Jobs

  1. Check for running builds: docker ps | grep build
  2. Wait for completion or cancel if stuck

Database Heavy Queries

  1. Check PostgreSQL:
    SELECT pid, now() - query_start AS duration, state, query
    FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;

Disk I/O Bottleneck

  1. Usually caused by:
    • Database queries
    • Large file operations
    • Container logging
  2. See Host-High-Disk runbook for cleanup

Too Many Concurrent Containers

  1. Check container count: docker ps | wc -l
  2. Consider stopping non-essential containers

Remediation Steps

Safe Actions (Automated)

  • Restart containers showing high CPU in docker stats
  • Clear Redis cache to reduce I/O
  • Stop non-essential background jobs

Requires Human Approval

  • Restart database containers
  • Kill database queries
  • Reduce parallelism in CI/CD jobs

Escalation Criteria

  • Load > 4x CPU count sustained
  • I/O wait > 50%
  • Critical services unresponsive
  • Root cause unclear

Post-Incident

  1. Identify which process/container caused load spike
  2. Consider resource limits for containers
  3. Review CI/CD parallelism settings
  4. Check if Colima needs more CPU allocation