Skip to main content

Host High Memory Runbook

Alert Details

  • Alert Name: HostHighMemoryUsage / HostCriticalMemoryUsage
  • Severity: Warning (>85%) / Critical (>95%)
  • Metrics: node_memory_MemAvailable_bytes, node_memory_MemTotal_bytes

Initial Assessment

1. Verify Current Memory Usage

# Check from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=(1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100"

# On host
colima ssh -- free -h

2. Identify Top Memory Consumers

# Container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -rh | head -15

# Process memory on host
colima ssh -- ps aux --sort=-%mem | head -15

Common Causes

Container Memory Leak

  1. Identify high-memory containers from docker stats
  2. Check if memory growing over time: docker stats <container> (watch for trend)
  3. Check container logs for OOM warnings
  4. Restart container if safe

Redis Cache Growth

  1. Check Redis memory: docker exec redis redis-cli info memory
  2. If high, clear non-essential caches:
    docker exec redis redis-cli KEYS "cache:*" | head -20
    docker exec redis redis-cli DEL "cache:temp:*"

Database Connection Pooling

  1. Check PostgreSQL connections:
    SELECT count(*) FROM pg_stat_activity;
  2. Kill idle connections if excessive

Log File Growth

  1. Check Docker log sizes:
    docker ps -q | xargs -I {} docker inspect --format='{{.LogPath}}' {} | xargs ls -lh
  2. Truncate if needed: truncate -s 0 <logfile>

Remediation Steps

Safe Actions (Automated)

  • Clear Redis cache (cache:* keys only)
  • Restart containers with known memory leaks
  • Trigger log rotation

Requires Human Approval

  • Restart database containers
  • Clear session/auth caches
  • Reduce Colima VM memory allocation

Escalation Criteria

  • Memory > 95% and not recovering
  • OOM kills occurring
  • Database or critical services affected
  • Unable to identify memory consumer

Post-Incident

  1. Document which container/process caused the issue
  2. Consider setting container memory limits
  3. Investigate memory leak in application code