View LLM Traces
Debug AI interactions using Langfuse tracing.
Overview
Langfuse captures every LLM interaction including:
- Prompts and completions
- Token usage and costs
- Latency metrics
- Tool calls and responses
- Conversation context
Access at: https://langfuse.ravenhelm.dev
When to Use
- AI gave an unexpected response
- Tool execution failed
- Performance seems slow
- Debugging prompt engineering
- Analyzing token costs
Finding a Trace
By Time
- Go to langfuse.ravenhelm.dev
- Click Traces in sidebar
- Filter by time range (last hour, today, etc.)
- Click on a trace to expand
By User
- Go to Traces
- Filter by User ID
- Find your user email or ID
By Session
Each conversation has a session ID:
- Filter by Session ID
- See all messages in that conversation
Understanding a Trace
Trace View
Trace: "What is on my todo list?"
├── Generation: GPT-4 (1.2s, 450 tokens)
│ ├── Input: System prompt + user message
│ ├── Output: Tool call - query_tasks
│ └── Metadata: temperature=0.7, model=gpt-4
│
├── Span: Tool Execution (0.8s)
│ ├── Tool: query_tasks
│ ├── Arguments: {"status": "open", "limit": 10}
│ └── Result: [3 tasks returned]
│
└── Generation: GPT-4 (0.6s, 200 tokens)
├── Input: Tool result + conversation
└── Output: "You have 3 tasks..."
Key Metrics
| Metric | What It Means |
|---|---|
| Latency | Total time for the interaction |
| Tokens | Input + output token count |
| Cost | Estimated API cost |
| Model | Which LLM was used |
Common Debugging Scenarios
Wrong Tool Selected
Symptom: Norns called the wrong tool
Debug Steps:
- Find the trace in Langfuse
- Look at the Generation step
- Check the system prompt - does it describe tools correctly?
- Check the user input - was it ambiguous?
- Look at tool descriptions in Bifrost
Example Finding:
User: "Add eggs"
Tool called: update_task (wrong)
Expected: add_shopping_item
Issue: "Add" is ambiguous - could be task or shopping
Solution: Improve tool descriptions or add disambiguation
Tool Execution Failed
Symptom: Tool returned an error
Debug Steps:
- Find the trace
- Expand the Span for tool execution
- Check the error message
- Look at input arguments - were they valid?
Example Finding:
Tool: create_task
Arguments: {"title": null} <-- Problem!
Error: "title is required"
Issue: LLM sent null for required field
Solution: Improve prompt to emphasize required fields
Slow Response
Symptom: Took too long to respond
Debug Steps:
- Find the trace
- Look at latency breakdown:
- Generation time (LLM)
- Tool execution time
- Network overhead
Example Finding:
Total: 8.5s
├── Generation 1: 1.2s (normal)
├── Tool (query_tasks): 6.0s <-- Problem!
└── Generation 2: 0.8s (normal)
Issue: Database query too slow
Solution: Add index or optimize query
Context Too Long
Symptom: Conversation degraded or errors
Debug Steps:
- Check token count in trace
- Look at input to later generations
- Check if context window exceeded
Example Finding:
Generation 15:
Input tokens: 12,000
Model limit: 8,000
Issue: Exceeded context window
Solution: Implement conversation summarization
Useful Filters
In Langfuse UI
| Filter | Use Case |
|---|---|
| Score < 0.5 | Find low-quality responses |
| Latency > 5s | Find slow interactions |
| Status = Error | Find failures |
| Model = gpt-4 | Filter by model |
| Tags = production | Production traces only |
Via API
# Get recent traces
curl https://langfuse.ravenhelm.dev/api/public/traces \
-H "Authorization: Bearer $LANGFUSE_API_KEY" \
-d {limit: 10}
Setting Up Tracing
Tracing is automatic for Norns. For custom integrations:
from langfuse import Langfuse
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://langfuse.ravenhelm.dev"
)
# Create a trace
trace = langfuse.trace(
name="my-interaction",
user_id="user@example.com"
)
# Log a generation
trace.generation(
name="llm-call",
model="gpt-4",
input=messages,
output=response
)
See Also
- [[Debug-Failing-Service]] - Container debugging
- [[Query-Logs]] - Log analysis
- [[../AI-ML-Platform/Langfuse]] - Langfuse setup
- [[../Observability/Grafana]] - Metrics dashboards