Runbook: Bifrost MCP Gateway Troubleshooting
Overview
- What: Troubleshoot Bifrost MCP gateway issues
- When: Tool calls failing, MCP server connections down, or API errors
- Duration: 5-20 minutes
- Services: bifrost-api, bifrost-admin
Prerequisites
- SSH access to odin as ravenhelm
- Access to Grafana for metrics
- Repo: https://gitlab.ravenhelm.dev/ravenmask/ravenmaskos
Quick Health Check
ssh ravenhelm@100.115.101.81
# Check Bifrost containers
docker ps | grep bifrost
# Quick API health
curl -s https://bifrost.ravenhelm.dev/health | jq
# Check admin UI
curl -s -o /dev/null -w '%{http_code}' https://bifrost-admin.ravenhelm.dev/
Common Issues
Issue 1: API Not Responding
Symptoms: 502/504 errors, tool calls timing out
# Check container status
docker logs --tail 100 bifrost-api
# Check for errors
docker logs bifrost-api 2>&1 | grep -i 'error\|exception\|failed' | tail -20
# Restart if needed
cd ~/ravenhelm/services/bifrost
docker compose restart bifrost-api
Issue 2: MCP Server Connection Failures
Symptoms: Specific tools not working, connection refused errors
# Check MCP server status in database
docker exec -i postgres psql -U ravenhelm -d ravenmaskos -c \
"SELECT name, status, last_seen_at FROM mcp_server_connections ORDER BY last_seen_at DESC"
# Check bifrost logs for MCP errors
docker logs bifrost-api 2>&1 | grep -i 'mcp\|connection\|server' | tail -30
# Verify MCP servers are reachable
# Check Linear MCP (example)
curl -s https://linear.ravenhelm.dev/health 2>/dev/null || echo 'Not reachable'
Issue 3: Tool Execution Timeouts
Symptoms: Tools hang, eventually timeout
# Check for slow operations
docker logs bifrost-api 2>&1 | grep -i 'timeout\|slow\|duration' | tail -20
# Check tool execution history
docker exec -i postgres psql -U ravenhelm -d ravenmaskos -c \
"SELECT tool_name, status, duration_ms, created_at FROM tool_executions ORDER BY created_at DESC LIMIT 20"
# Check external service health (the actual tool backends)
curl -s https://gitlab.ravenhelm.dev/api/v4/user -H 'PRIVATE-TOKEN: ...' | jq '.username'
Issue 4: Authentication Errors
Symptoms: 401/403 errors, OAuth failures
# Check for auth errors
docker logs bifrost-api 2>&1 | grep -i 'auth\|401\|403\|oauth\|token' | tail -20
# Verify OAuth credentials in vault
docker exec openbao vault kv list secret/bifrost/
# Check Zitadel token endpoint
curl -s https://auth.ravenhelm.dev/.well-known/openid-configuration | jq '.token_endpoint'
Issue 5: High Memory/CPU Usage
Symptoms: Slow responses, container OOM killed
# Check container resources
docker stats --no-stream bifrost-api bifrost-admin
# Check for memory leaks in logs
docker logs bifrost-api 2>&1 | grep -i 'memory\|oom\|heap' | tail -20
# Restart to clear memory
cd ~/ravenhelm/services/bifrost
docker compose restart
Full Service Restart
cd ~/ravenhelm/services/bifrost
# Graceful restart
docker compose restart
# Full recreate (if issues persist)
docker compose down
docker compose up -d
# Watch logs
docker compose logs -f --tail 50
Verification
# Test API health
curl -s https://bifrost.ravenhelm.dev/health | jq
# Test a simple tool call (list tools)
curl -s https://bifrost.ravenhelm.dev/tools | jq '.tools | length'
# Check admin UI is accessible
open https://bifrost-admin.ravenhelm.dev/
Escalation
If issues persist:
- Check if upstream MCP servers are healthy
- Review recent changes to Bifrost configuration
- Check Grafana for anomalous metrics patterns