Skip to main content

Host High CPU Runbook

Alert Details

  • Alert Name: HostHighCpuUsage / HostCriticalCpuUsage
  • Severity: Warning (>80%) / Critical (>95%)
  • Metrics: node_cpu_seconds_total

Initial Assessment

1. Verify Current CPU Usage

# Check CPU usage from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=100-(avg(rate(node_cpu_seconds_total{mode='idle'}[5m]))*100)"

# On host directly (via colima ssh)
colima ssh -- top -bn1 | head -20

2. Identify Top CPU Consumers

# Check container CPU usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" | sort -k2 -rh | head -15

# Check process CPU on host
colima ssh -- ps aux --sort=-%cpu | head -15

Common Causes

Container Runaway Process

  1. Identify the high-CPU container from docker stats
  2. Check container logs: docker logs <container> --tail 100
  3. If safe to restart: docker restart <container>

Build/Compilation Job

  1. Check if CI/CD build is running: docker ps | grep -i build
  2. If expected, wait for completion
  3. If stuck, check runner logs

Database Query Load

  1. Check PostgreSQL active queries:
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC;
  2. Kill long-running queries if needed: SELECT pg_cancel_backend(pid);

Memory Pressure (CPU in swap)

  1. Check if swapping: colima ssh -- free -h
  2. If high swap usage, see Host-High-Memory runbook

Remediation Steps

Safe Actions (Automated)

  • Restart non-critical containers (nginx, redis, grafana)
  • Scale down non-essential workloads
  • Clear caches if memory pressure is contributing

Requires Human Approval

  • Restart database containers
  • Kill active database connections
  • Scale down production services

Escalation Criteria

  • CPU > 95% for more than 10 minutes
  • Multiple hosts affected
  • Critical services degraded
  • Unknown root cause after initial investigation

Post-Incident

  1. Document root cause in incident timeline
  2. Add preventive alerts if new pattern identified
  3. Update this runbook if new resolution steps discovered