Runbook: Vidar SRE Agent OpenAI Rate Limiting

Purpose

Resolve OpenAI API rate limit errors (429) affecting the Vidar SRE agent's ability to investigate and remediate alerts.

Symptoms

SRE agent runs fail with status: failed
Logs show: openai.RateLimitError: Error code: 429 - Rate limit reached for gpt-4o
Multiple agents spawning and hitting TPM (tokens per minute) limits
Alert investigation not completing

Prerequisites

SSH access to odin as ravenhelm
Access to vidar-api container
OpenAI API account access (optional - for limit increases)

Diagnosis

Step 1: Identify Rate Limit Errors

# Check vidar-api logs for rate limit errors
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep -i '429\|rate limit' | tail -10"

Example output:

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XXX on tokens per min (TPM): Limit 30000, Used 29376, Requested 2006.'}}

Step 2: Check Current Configuration

# Check agent scheduler settings
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"

Example output:

Agent scheduler started: interval=60s, max_concurrent=10, cooldown=30m

Step 3: Check Agent Model

# Check which model is being used
ssh ravenhelm@100.115.101.81 "grep SRE_AGENT_MODEL /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"

Procedure

Option A: Reduce Concurrent Agents (Recommended)

Add or update environment variables in docker-compose.yml:

ssh ravenhelm@100.115.101.81 "grep -A5 'SRE_AGENT_ENABLED' /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"

If settings don't exist, add them:

environment:
  - SRE_AGENT_ENABLED=true
  - SRE_AGENT_MAX_CONCURRENT=3    # Reduce from default 10
  - SRE_AGENT_MODEL=gpt-4o-mini   # Use model with higher limits

Option B: Switch to gpt-4o-mini

The gpt-4o-mini model has higher rate limits and lower cost:

# Add to docker-compose.yml under vidar-api environment
- SRE_AGENT_MODEL=gpt-4o-mini

Option C: Switch to Anthropic (Requires Code Changes)

For organizations with Anthropic API access, the SRE agent can be modified to use Claude instead.

Step 2: Apply Changes

# Restart vidar-api with new configuration
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"

Step 3: Verification

# Check new configuration is applied
ssh ravenhelm@100.115.101.81 "docker logs vidar-api 2>&1 | grep 'Agent scheduler started'"

# Monitor for rate limit errors (should decrease or disappear)
ssh ravenhelm@100.115.101.81 "docker logs vidar-api --tail 50 2>&1 | grep -c '429'"

Success criteria:

Agent scheduler shows reduced max_concurrent
No new 429 errors in logs
SRE agent runs completing successfully

Rate Limit Reference

Model	TPM Limit	RPM Limit	Recommended max_concurrent
gpt-4o	30,000	500	2-3
gpt-4o-mini	200,000	500	5-10
gpt-4-turbo	30,000	500	2-3

Calculation

Each SRE agent run typically consumes:

4-6 LLM calls (context, metrics, logs, analysis, decision)
~2,000-4,000 tokens per call
Total: ~10,000-20,000 tokens per run

With max_concurrent=10 and 60s intervals, worst case:

10 agents × 20,000 tokens = 200,000 TPM demand
gpt-4o limit: 30,000 TPM = guaranteed rate limiting

With max_concurrent=3:

3 agents × 20,000 tokens = 60,000 TPM demand
With gpt-4o-mini (200,000 TPM): comfortable headroom

Rollback

# Remove rate limiting settings or restore previous values
ssh ravenhelm@100.115.101.81 "vim /Users/ravenhelm/ravenhelm/services/vidar/docker-compose.yml"
# Remove SRE_AGENT_MAX_CONCURRENT and SRE_AGENT_MODEL lines

# Restart
ssh ravenhelm@100.115.101.81 "cd /Users/ravenhelm/ravenhelm/services/vidar && docker compose up -d vidar-api"

Long-term Solutions

Request OpenAI limit increase: Contact OpenAI to increase TPM limits
Implement token budgeting: Track token usage and pause spawning when near limits
Use multiple API keys: Distribute load across multiple OpenAI organizations
Implement caching: Cache common investigation results to reduce API calls

Escalation

If rate limiting continues after applying fixes:

Check OpenAI API status page for outages
Review agent investigation prompts for optimization
Consider switching to Anthropic Claude
Contact: Platform team

Purpose​

Symptoms​

Prerequisites​

Diagnosis​

Step 1: Identify Rate Limit Errors​

Step 2: Check Current Configuration​

Step 3: Check Agent Model​

Procedure​

Option A: Reduce Concurrent Agents (Recommended)​

Option B: Switch to gpt-4o-mini​

Option C: Switch to Anthropic (Requires Code Changes)​

Step 2: Apply Changes​

Step 3: Verification​

Rate Limit Reference​

Calculation​

Rollback​

Long-term Solutions​

Escalation​

Related​