Runbook: Vidar SRE Agent OpenAI Rate Limiting
Purpose
Resolve OpenAI API rate limit errors (429) affecting the Vidar SRE agent's ability to investigate and remediate alerts.
Symptoms
- SRE agent runs fail with status:
failed - Logs show:
openai.RateLimitError: Error code: 429 - Rate limit reached for gpt-4o - Multiple agents spawning and hitting TPM (tokens per minute) limits
- Alert investigation not completing
Prerequisites
- SSH access to odin as ravenhelm
- Access to vidar-api container
- OpenAI API account access (optional - for limit increases)
Diagnosis
Step 1: Identify Rate Limit Errors
# Check vidar-api logs for rate limit errors
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep -i '429\|rate limit' | tail -10"
Example output:
openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Used 29376, Requested 2006.'}}
Step 2: Check Current Configuration
# Check agent scheduler settings
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"
Example output:
Agent scheduler started: interval=60s, max_concurrent=10, cooldown=30m
Step 3: Check Agent Model
# Check which model is being used
ssh ravenhelm@100.115.101.81 "grep SRE_AGENT_MODEL /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"
Procedure
Option A: Reduce Concurrent Agents (Recommended)
Add or update environment variables in docker-compose.yml:
ssh ravenhelm@100.115.101.81 "grep -A5 'SRE_AGENT_ENABLED' /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"
If settings don't exist, add them:
environment:
- SRE_AGENT_ENABLED=true
- SRE_AGENT_MAX_CONCURRENT=3 # Reduce from default 10
- SRE_AGENT_MODEL=gpt-4o-mini # Use model with higher limits
Option B: Switch to gpt-4o-mini
The gpt-4o-mini model has higher rate limits and lower cost:
# Add to docker-compose.yml under vidar-api environment
- SRE_AGENT_MODEL=gpt-4o-mini
Option C: Switch to Anthropic (Requires Code Changes)
For organizations with Anthropic API access, the SRE agent can be modified to use Claude instead.
Step 2: Apply Changes
# Restart vidar-api with new configuration
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"
Step 3: Verification
# Check new configuration is applied
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"
# Monitor for rate limit errors (should decrease or disappear)
ssh ravenhelm@100.115.101.81 "docker logs vidar-api --tail 50 2>&1 | grep -c '429'"
Success criteria:
- Agent scheduler shows reduced
max_concurrent - No new 429 errors in logs
- SRE agent runs completing successfully
Rate Limit Reference
| Model | TPM Limit | RPM Limit | Recommended max_concurrent |
|---|---|---|---|
| gpt-4o | 30,000 | 500 | 2-3 |
| gpt-4o-mini | 200,000 | 500 | 5-10 |
| gpt-4-turbo | 30,000 | 500 | 2-3 |
Calculation
Each SRE agent run typically consumes:
- 4-6 LLM calls (context, metrics, logs, analysis, decision)
- ~2,000-4,000 tokens per call
- Total: ~10,000-20,000 tokens per run
With max_concurrent=10 and 60s intervals, worst case:
- 10 agents × 20,000 tokens = 200,000 TPM demand
- gpt-4o limit: 30,000 TPM = guaranteed rate limiting
With max_concurrent=3:
- 3 agents × 20,000 tokens = 60,000 TPM demand
- With gpt-4o-mini (200,000 TPM): comfortable headroom
Rollback
# Remove rate limiting settings or restore previous values
ssh ravenhelm@100.115.101.81 "vim /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"
# Remove SRE_AGENT_MAX_CONCURRENT and SRE_AGENT_MODEL lines
# Restart
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"
Long-term Solutions
- Request OpenAI limit increase: Contact OpenAI to increase TPM limits
- Implement token budgeting: Track token usage and pause spawning when near limits
- Use multiple API keys: Distribute load across multiple OpenAI organizations
- Implement caching: Cache common investigation results to reduce API calls
Escalation
If rate limiting continues after applying fixes:
- Check OpenAI API status page for outages
- Review agent investigation prompts for optimization
- Consider switching to Anthropic Claude
- Contact: Platform team