Skip to main content

Runbook: Bifrost MCP Gateway Troubleshooting

Overview

  • What: Troubleshoot Bifrost MCP gateway issues
  • When: Tool calls failing, MCP server connections down, or API errors
  • Duration: 5-20 minutes
  • Services: bifrost-api, bifrost-admin

Prerequisites

Quick Health Check

ssh ravenhelm@100.115.101.81

# Check Bifrost containers
docker ps | grep bifrost

# Quick API health
curl -s https://bifrost.ravenhelm.dev/health | jq

# Check admin UI
curl -s -o /dev/null -w '%{http_code}' https://bifrost-admin.ravenhelm.dev/

Common Issues

Issue 1: API Not Responding

Symptoms: 502/504 errors, tool calls timing out

# Check container status
docker logs --tail 100 bifrost-api

# Check for errors
docker logs bifrost-api 2>&1 | grep -i 'error\|exception\|failed' | tail -20

# Restart if needed
cd ~/ravenhelm/services/bifrost
docker compose restart bifrost-api

Issue 2: MCP Server Connection Failures

Symptoms: Specific tools not working, connection refused errors

# Check MCP server status in database
docker exec -i postgres psql -U ravenhelm -d ravenmaskos -c \
"SELECT name, status, last_seen_at FROM mcp_server_connections ORDER BY last_seen_at DESC"

# Check bifrost logs for MCP errors
docker logs bifrost-api 2>&1 | grep -i 'mcp\|connection\|server' | tail -30

# Verify MCP servers are reachable
# Check Linear MCP (example)
curl -s https://linear.ravenhelm.dev/health 2>/dev/null || echo 'Not reachable'

Issue 3: Tool Execution Timeouts

Symptoms: Tools hang, eventually timeout

# Check for slow operations
docker logs bifrost-api 2>&1 | grep -i 'timeout\|slow\|duration' | tail -20

# Check tool execution history
docker exec -i postgres psql -U ravenhelm -d ravenmaskos -c \
"SELECT tool_name, status, duration_ms, created_at FROM tool_executions ORDER BY created_at DESC LIMIT 20"

# Check external service health (the actual tool backends)
curl -s https://gitlab.ravenhelm.dev/api/v4/user -H 'PRIVATE-TOKEN: ...' | jq '.username'

Issue 4: Authentication Errors

Symptoms: 401/403 errors, OAuth failures

# Check for auth errors
docker logs bifrost-api 2>&1 | grep -i 'auth\|401\|403\|oauth\|token' | tail -20

# Verify OAuth credentials in vault
docker exec openbao vault kv list secret/bifrost/

# Check Zitadel token endpoint
curl -s https://auth.ravenhelm.dev/.well-known/openid-configuration | jq '.token_endpoint'

Issue 5: High Memory/CPU Usage

Symptoms: Slow responses, container OOM killed

# Check container resources
docker stats --no-stream bifrost-api bifrost-admin

# Check for memory leaks in logs
docker logs bifrost-api 2>&1 | grep -i 'memory\|oom\|heap' | tail -20

# Restart to clear memory
cd ~/ravenhelm/services/bifrost
docker compose restart

Full Service Restart

cd ~/ravenhelm/services/bifrost

# Graceful restart
docker compose restart

# Full recreate (if issues persist)
docker compose down
docker compose up -d

# Watch logs
docker compose logs -f --tail 50

Verification

# Test API health
curl -s https://bifrost.ravenhelm.dev/health | jq

# Test a simple tool call (list tools)
curl -s https://bifrost.ravenhelm.dev/tools | jq '.tools | length'

# Check admin UI is accessible
open https://bifrost-admin.ravenhelm.dev/

Escalation

If issues persist:

  1. Check if upstream MCP servers are healthy
  2. Review recent changes to Bifrost configuration
  3. Check Grafana for anomalous metrics patterns