Skip to main content

View LLM Traces

Debug AI interactions using Langfuse tracing.


Overview

Langfuse captures every LLM interaction including:

  • Prompts and completions
  • Token usage and costs
  • Latency metrics
  • Tool calls and responses
  • Conversation context

Access at: https://langfuse.ravenhelm.dev


When to Use

  • AI gave an unexpected response
  • Tool execution failed
  • Performance seems slow
  • Debugging prompt engineering
  • Analyzing token costs

Finding a Trace

By Time

  1. Go to langfuse.ravenhelm.dev
  2. Click Traces in sidebar
  3. Filter by time range (last hour, today, etc.)
  4. Click on a trace to expand

By User

  1. Go to Traces
  2. Filter by User ID
  3. Find your user email or ID

By Session

Each conversation has a session ID:

  1. Filter by Session ID
  2. See all messages in that conversation

Understanding a Trace

Trace View

Trace: "What is on my todo list?"
├── Generation: GPT-4 (1.2s, 450 tokens)
│ ├── Input: System prompt + user message
│ ├── Output: Tool call - query_tasks
│ └── Metadata: temperature=0.7, model=gpt-4

├── Span: Tool Execution (0.8s)
│ ├── Tool: query_tasks
│ ├── Arguments: {"status": "open", "limit": 10}
│ └── Result: [3 tasks returned]

└── Generation: GPT-4 (0.6s, 200 tokens)
├── Input: Tool result + conversation
└── Output: "You have 3 tasks..."

Key Metrics

MetricWhat It Means
LatencyTotal time for the interaction
TokensInput + output token count
CostEstimated API cost
ModelWhich LLM was used

Common Debugging Scenarios

Wrong Tool Selected

Symptom: Norns called the wrong tool

Debug Steps:

  1. Find the trace in Langfuse
  2. Look at the Generation step
  3. Check the system prompt - does it describe tools correctly?
  4. Check the user input - was it ambiguous?
  5. Look at tool descriptions in Bifrost

Example Finding:

User: "Add eggs"
Tool called: update_task (wrong)
Expected: add_shopping_item

Issue: "Add" is ambiguous - could be task or shopping
Solution: Improve tool descriptions or add disambiguation

Tool Execution Failed

Symptom: Tool returned an error

Debug Steps:

  1. Find the trace
  2. Expand the Span for tool execution
  3. Check the error message
  4. Look at input arguments - were they valid?

Example Finding:

Tool: create_task
Arguments: {"title": null} <-- Problem!
Error: "title is required"

Issue: LLM sent null for required field
Solution: Improve prompt to emphasize required fields

Slow Response

Symptom: Took too long to respond

Debug Steps:

  1. Find the trace
  2. Look at latency breakdown:
    • Generation time (LLM)
    • Tool execution time
    • Network overhead

Example Finding:

Total: 8.5s
├── Generation 1: 1.2s (normal)
├── Tool (query_tasks): 6.0s <-- Problem!
└── Generation 2: 0.8s (normal)

Issue: Database query too slow
Solution: Add index or optimize query

Context Too Long

Symptom: Conversation degraded or errors

Debug Steps:

  1. Check token count in trace
  2. Look at input to later generations
  3. Check if context window exceeded

Example Finding:

Generation 15:
Input tokens: 12,000
Model limit: 8,000

Issue: Exceeded context window
Solution: Implement conversation summarization

Useful Filters

In Langfuse UI

FilterUse Case
Score < 0.5Find low-quality responses
Latency > 5sFind slow interactions
Status = ErrorFind failures
Model = gpt-4Filter by model
Tags = productionProduction traces only

Via API

# Get recent traces
curl https://langfuse.ravenhelm.dev/api/public/traces \
-H "Authorization: Bearer $LANGFUSE_API_KEY" \
-d {limit: 10}

Setting Up Tracing

Tracing is automatic for Norns. For custom integrations:

from langfuse import Langfuse

langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://langfuse.ravenhelm.dev"
)

# Create a trace
trace = langfuse.trace(
name="my-interaction",
user_id="user@example.com"
)

# Log a generation
trace.generation(
name="llm-call",
model="gpt-4",
input=messages,
output=response
)

See Also

  • [[Debug-Failing-Service]] - Container debugging
  • [[Query-Logs]] - Log analysis
  • [[../AI-ML-Platform/Langfuse]] - Langfuse setup
  • [[../Observability/Grafana]] - Metrics dashboards