Skip to main content

Runbook: Vidar SRE Agent OpenAI Rate Limiting

Purpose

Resolve OpenAI API rate limit errors (429) affecting the Vidar SRE agent's ability to investigate and remediate alerts.

Symptoms

  • SRE agent runs fail with status: failed
  • Logs show: openai.RateLimitError: Error code: 429 - Rate limit reached for gpt-4o
  • Multiple agents spawning and hitting TPM (tokens per minute) limits
  • Alert investigation not completing

Prerequisites

  • SSH access to odin as ravenhelm
  • Access to vidar-api container
  • OpenAI API account access (optional - for limit increases)

Diagnosis

Step 1: Identify Rate Limit Errors

# Check vidar-api logs for rate limit errors
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep -i '429\|rate limit' | tail -10"

Example output:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Used 29376, Requested 2006.'}}

Step 2: Check Current Configuration

# Check agent scheduler settings
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"

Example output:

Agent scheduler started: interval=60s, max_concurrent=10, cooldown=30m

Step 3: Check Agent Model

# Check which model is being used
ssh ravenhelm@100.115.101.81 "grep SRE_AGENT_MODEL /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"

Procedure

Add or update environment variables in docker-compose.yml:

ssh ravenhelm@100.115.101.81 "grep -A5 'SRE_AGENT_ENABLED' /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"

If settings don't exist, add them:

environment:
- SRE_AGENT_ENABLED=true
- SRE_AGENT_MAX_CONCURRENT=3 # Reduce from default 10
- SRE_AGENT_MODEL=gpt-4o-mini # Use model with higher limits

Option B: Switch to gpt-4o-mini

The gpt-4o-mini model has higher rate limits and lower cost:

# Add to docker-compose.yml under vidar-api environment
- SRE_AGENT_MODEL=gpt-4o-mini

Option C: Switch to Anthropic (Requires Code Changes)

For organizations with Anthropic API access, the SRE agent can be modified to use Claude instead.

Step 2: Apply Changes

# Restart vidar-api with new configuration
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"

Step 3: Verification

# Check new configuration is applied
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"

# Monitor for rate limit errors (should decrease or disappear)
ssh ravenhelm@100.115.101.81 "docker logs vidar-api --tail 50 2>&1 | grep -c '429'"

Success criteria:

  • Agent scheduler shows reduced max_concurrent
  • No new 429 errors in logs
  • SRE agent runs completing successfully

Rate Limit Reference

ModelTPM LimitRPM LimitRecommended max_concurrent
gpt-4o30,0005002-3
gpt-4o-mini200,0005005-10
gpt-4-turbo30,0005002-3

Calculation

Each SRE agent run typically consumes:

  • 4-6 LLM calls (context, metrics, logs, analysis, decision)
  • ~2,000-4,000 tokens per call
  • Total: ~10,000-20,000 tokens per run

With max_concurrent=10 and 60s intervals, worst case:

  • 10 agents × 20,000 tokens = 200,000 TPM demand
  • gpt-4o limit: 30,000 TPM = guaranteed rate limiting

With max_concurrent=3:

  • 3 agents × 20,000 tokens = 60,000 TPM demand
  • With gpt-4o-mini (200,000 TPM): comfortable headroom

Rollback

# Remove rate limiting settings or restore previous values
ssh ravenhelm@100.115.101.81 "vim /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"
# Remove SRE_AGENT_MAX_CONCURRENT and SRE_AGENT_MODEL lines

# Restart
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"

Long-term Solutions

  1. Request OpenAI limit increase: Contact OpenAI to increase TPM limits
  2. Implement token budgeting: Track token usage and pause spawning when near limits
  3. Use multiple API keys: Distribute load across multiple OpenAI organizations
  4. Implement caching: Cache common investigation results to reduce API calls

Escalation

If rate limiting continues after applying fixes:

  1. Check OpenAI API status page for outages
  2. Review agent investigation prompts for optimization
  3. Consider switching to Anthropic Claude
  4. Contact: Platform team