Skip to main content

Host High Disk Usage Runbook

Alert Details

  • Alert Name: HostHighDiskUsage / HostCriticalDiskUsage
  • Severity: Warning (>80%) / Critical (>90%)
  • Metrics: node_filesystem_avail_bytes, node_filesystem_size_bytes

Initial Assessment

1. Check Disk Usage

# From Prometheus
curl -s "http://localhost:9090/api/v1/query?query=(1-(node_filesystem_avail_bytes/node_filesystem_size_bytes))*100"

# On host
colima ssh -- df -h

2. Identify Space Consumers

# Docker disk usage summary
docker system df

# Large directories in Colima
colima ssh -- du -sh /var/lib/docker/*
colima ssh -- du -sh /var/lib/containerd/*

# Unused Docker resources
docker system df -v

Common Causes

Docker Images Accumulation

  1. List unused images:
    docker images -f "dangling=true"
    docker images | grep -v "latest" | tail -20
  2. Clean up:
    docker image prune -f  # Remove dangling images
    docker image prune -a --filter "until=168h" # Remove unused >7 days

Docker Build Cache

  1. Check builder cache: docker builder du
  2. Clean: docker builder prune -f

Container Logs

  1. Find large log files:
    docker ps -q | xargs -I {} sh -c 'echo "Container: {}"; docker inspect --format="{{.LogPath}}" {} | xargs ls -lh'
  2. Truncate: truncate -s 0 <logfile>

Unused Volumes

  1. List unused volumes: docker volume ls -f "dangling=true"
  2. Clean: docker volume prune -f

Database Growth

  1. Check PostgreSQL size:
    SELECT pg_size_pretty(pg_database_size('ravenmaskos'));
  2. Consider vacuum: VACUUM FULL; (causes downtime)

Remediation Steps

Safe Actions (Automated)

  • Remove dangling Docker images
  • Prune Docker build cache
  • Remove unused Docker volumes
  • Truncate container logs

Requires Human Approval

  • Remove specific Docker images
  • Vacuum PostgreSQL database
  • Resize Colima disk
  • Delete application data

Disk Resize (Colima)

If cleanup is insufficient, resize the Colima VM disk:

# Stop Colima
colima stop

# Resize (example: increase to 100GB)
colima start --disk 100

# Verify
colima ssh -- df -h

Escalation Criteria

  • Disk > 95% after automated cleanup
  • Database storage growing rapidly
  • Unable to identify space consumer
  • Application data deletion required

Post-Incident

  1. Set up alerting for specific containers if one caused issue
  2. Configure log rotation if logs were the problem
  3. Schedule regular cleanup jobs in n8n