Host High Memory Runbook
Alert Details
- Alert Name: HostHighMemoryUsage / HostCriticalMemoryUsage
- Severity: Warning (>85%) / Critical (>95%)
- Metrics:
node_memory_MemAvailable_bytes,node_memory_MemTotal_bytes
Initial Assessment
1. Verify Current Memory Usage
# Check from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=(1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100"
# On host
colima ssh -- free -h
2. Identify Top Memory Consumers
# Container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -rh | head -15
# Process memory on host
colima ssh -- ps aux --sort=-%mem | head -15
Common Causes
Container Memory Leak
- Identify high-memory containers from
docker stats - Check if memory growing over time:
docker stats <container>(watch for trend) - Check container logs for OOM warnings
- Restart container if safe
Redis Cache Growth
- Check Redis memory:
docker exec redis redis-cli info memory - If high, clear non-essential caches:
docker exec redis redis-cli KEYS "cache:*" | head -20
docker exec redis redis-cli DEL "cache:temp:*"
Database Connection Pooling
- Check PostgreSQL connections:
SELECT count(*) FROM pg_stat_activity; - Kill idle connections if excessive
Log File Growth
- Check Docker log sizes:
docker ps -q | xargs -I {} docker inspect --format='{{.LogPath}}' {} | xargs ls -lh - Truncate if needed:
truncate -s 0 <logfile>
Remediation Steps
Safe Actions (Automated)
- Clear Redis cache (
cache:*keys only) - Restart containers with known memory leaks
- Trigger log rotation
Requires Human Approval
- Restart database containers
- Clear session/auth caches
- Reduce Colima VM memory allocation
Escalation Criteria
- Memory > 95% and not recovering
- OOM kills occurring
- Database or critical services affected
- Unable to identify memory consumer
Post-Incident
- Document which container/process caused the issue
- Consider setting container memory limits
- Investigate memory leak in application code