Host High Disk Usage Runbook
Alert Details
- Alert Name: HostHighDiskUsage / HostCriticalDiskUsage
- Severity: Warning (>80%) / Critical (>90%)
- Metrics:
node_filesystem_avail_bytes,node_filesystem_size_bytes
Initial Assessment
1. Check Disk Usage
# From Prometheus
curl -s "http://localhost:9090/api/v1/query?query=(1-(node_filesystem_avail_bytes/node_filesystem_size_bytes))*100"
# On host
colima ssh -- df -h
2. Identify Space Consumers
# Docker disk usage summary
docker system df
# Large directories in Colima
colima ssh -- du -sh /var/lib/docker/*
colima ssh -- du -sh /var/lib/containerd/*
# Unused Docker resources
docker system df -v
Common Causes
Docker Images Accumulation
- List unused images:
docker images -f "dangling=true"
docker images | grep -v "latest" | tail -20 - Clean up:
docker image prune -f # Remove dangling images
docker image prune -a --filter "until=168h" # Remove unused >7 days
Docker Build Cache
- Check builder cache:
docker builder du - Clean:
docker builder prune -f
Container Logs
- Find large log files:
docker ps -q | xargs -I {} sh -c 'echo "Container: {}"; docker inspect --format="{{.LogPath}}" {} | xargs ls -lh' - Truncate:
truncate -s 0 <logfile>
Unused Volumes
- List unused volumes:
docker volume ls -f "dangling=true" - Clean:
docker volume prune -f
Database Growth
- Check PostgreSQL size:
SELECT pg_size_pretty(pg_database_size('ravenmaskos')); - Consider vacuum:
VACUUM FULL;(causes downtime)
Remediation Steps
Safe Actions (Automated)
- Remove dangling Docker images
- Prune Docker build cache
- Remove unused Docker volumes
- Truncate container logs
Requires Human Approval
- Remove specific Docker images
- Vacuum PostgreSQL database
- Resize Colima disk
- Delete application data
Disk Resize (Colima)
If cleanup is insufficient, resize the Colima VM disk:
# Stop Colima
colima stop
# Resize (example: increase to 100GB)
colima start --disk 100
# Verify
colima ssh -- df -h
Escalation Criteria
- Disk > 95% after automated cleanup
- Database storage growing rapidly
- Unable to identify space consumer
- Application data deletion required
Post-Incident
- Set up alerting for specific containers if one caused issue
- Configure log rotation if logs were the problem
- Schedule regular cleanup jobs in n8n