Host High Memory Runbook

Alert Details

Alert Name: HostHighMemoryUsage / HostCriticalMemoryUsage
Severity: Warning (>85%) / Critical (>95%)
Metrics: node_memory_MemAvailable_bytes, node_memory_MemTotal_bytes

Initial Assessment

1. Verify Current Memory Usage

# Check from Prometheus
curl -s "http://localhost:9090/api/v1/query?query=(1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100"

# On host
colima ssh -- free -h

2. Identify Top Memory Consumers

# Container memory usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -rh | head -15

# Process memory on host
colima ssh -- ps aux --sort=-%mem | head -15

Common Causes

Container Memory Leak

Identify high-memory containers from docker stats
Check if memory growing over time: docker stats <container> (watch for trend)
Check container logs for OOM warnings
Restart container if safe

Redis Cache Growth

Check Redis memory: docker exec redis redis-cli info memory

If high, clear non-essential caches:

docker exec redis redis-cli KEYS "cache:*" | head -20
docker exec redis redis-cli DEL "cache:temp:*"

Database Connection Pooling

Check PostgreSQL connections:
```
SELECT count(*) FROM pg_stat_activity;
```
Kill idle connections if excessive

Log File Growth

Check Docker log sizes:

docker ps -q | xargs -I {} docker inspect --format='{{.LogPath}}' {} | xargs ls -lh

Truncate if needed: truncate -s 0 <logfile>

Remediation Steps

Safe Actions (Automated)

Clear Redis cache (cache:* keys only)
Restart containers with known memory leaks
Trigger log rotation

Requires Human Approval

Restart database containers
Clear session/auth caches
Reduce Colima VM memory allocation

Escalation Criteria

Memory > 95% and not recovering
OOM kills occurring
Database or critical services affected
Unable to identify memory consumer

Post-Incident

Document which container/process caused the issue
Consider setting container memory limits
Investigate memory leak in application code

Alert Details​

Initial Assessment​

1. Verify Current Memory Usage​

2. Identify Top Memory Consumers​

Common Causes​

Container Memory Leak​

Redis Cache Growth​

Database Connection Pooling​

Log File Growth​

Remediation Steps​

Safe Actions (Automated)​

Requires Human Approval​

Escalation Criteria​

Post-Incident​

Alert Details

Initial Assessment

1. Verify Current Memory Usage

2. Identify Top Memory Consumers

Common Causes

Container Memory Leak

Redis Cache Growth

Database Connection Pooling

Log File Growth

Remediation Steps

Safe Actions (Automated)

Requires Human Approval

Escalation Criteria

Post-Incident