Tracing

Overview

judgeval's tracing module allows you to view your LLM application's execution from end-to-end.

Using tracing, you can:

Gain observability into every layer of your agentic system, from database queries to tool calling and text generation.
Measure the performance of each system component in any way you want to measure it. For instance:
- Catch regressions in retrieval quality, factuality, answer relevance, and 10+ other research-backed metrics.
- Quantify the quality of each tool call your agent makes
- Track the latency of each system component
- Count the token usage of each LLM generation
Export your workflow runs to the Judgment platform for real-time analysis or as a dataset for offline experimentation.

Tracing Your Workflow

Setting up tracing with judgeval takes two simple steps:

1. Initialize a tracer with your API keys and project name

from judgeval.tracer import Tracer

# loads from JUDGMENT_API_KEY and JUDGMENT_ORG_ID env vars
judgment = Tracer(project_name="my_project")

The Judgment tracer is a singleton object that should be shared across your application. Your project name will be used to organize your traces in one place on the Judgment platform.

2. Wrap your workflow components

judgeval provides wrapping mechanisms for your workflow components:

`wrap()`

The wrap() function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:

Latency
Token usage
Prompt/Completion
Model name

Here's an example of using wrap() on an OpenAI client:

from openai import OpenAI
from judgeval.tracer import wrap

client = wrap(OpenAI())

When using OpenAI streaming with a wrapped client, you need to explicitly enable token usage tracking by setting stream_options={"include_usage": True}. Otherwise, token counts won't be captured for streaming calls.

# Enable token counting with streaming
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
    stream_options={"include_usage": True}  # Required for token counting
)

`@observe` (Python) / `observe()` (Typescript)

The @observe decorator (Python) or the observe() higher-order function (Typescript) wraps your functions/tools and captures metadata surrounding your function calls, such as:

Latency
Input/Output
Span type (e.g. retriever, tool, LLM call, etc.)

Here's an example of using the observer mechanism:

from judgeval.tracer import Tracer

# loads from JUDGMENT_API_KEY env var
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    print("Hello world!")

span_type is a string that you can use to categorize and organize your trace spans. Span types are displayed on the trace UI to easily navigate a visualization of your workflow. Common span types include tool, function, retriever, database, web search, etc.

Automatic Deep Tracing

Judgeval includes automatic deep tracing, which significantly reduces the amount of instrumentation needed in your code. With deep tracing enabled (the default), you only need to observe top-level functions, and all nested function calls will be automatically traced.

How Deep Tracing Works

When you decorate a function with @observe (Python) or wrap it with observe() (TypeScript), the tracer automatically instruments all functions called within that function, creating a complete trace of your execution flow without requiring explicit decorators on every function.

# Deep tracing is enabled by default
judgment = Tracer(project_name="my_project")

# Only need to observe the top-level function
@judgment.observe(span_type="function")
def main():
    # These functions will be automatically traced without @observe
    result = helper_function()
    return process_result(result)

def helper_function():
    return "Helper result"

def process_result(result):
    return f"Processed: {result}"

main()  # Traces main, helper_function, and process_result

Disabling Deep Tracing

If you prefer more control over what gets traced, you can disable deep tracing:

# Disable deep tracing globally
judgment = Tracer(project_name="my_project", deep_tracing=False)

# Or disable for specific functions
@judgment.observe(span_type="function", deep_tracing=False)
def selective_function():
    helper_function()  # Won't be traced automatically

With deep tracing disabled, you'll need to explicitly observe each function you want to trace. You can still name and declare span types for each function using jdugement.observe().

Putting it all Together

Here's a complete example of using judgeval's tracing mechanisms:

from judgeval.tracer import Tracer, wrap
from openai import OpenAI

openai_client = wrap(OpenAI())
# loads from JUDGMENT_API_KEY and JUDGMENT_ORG_ID env vars
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def my_llm_call():
    message = my_tool()
    res = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": message}]
    )
    return res.choices[0].message.content

# This implicitly starts a trace if one isn't active
# and saves it upon completion or error.
main_result = my_llm_call()

And the trace will appear on the Judgment platform as follows:

Alt text

Using Streaming with Token Counting

When using streaming responses with a wrapped client, you need to explicitly enable token usage tracking:

@judgment.observe(span_type="function")
def my_llm_streaming_call():
    # Enable token counting with streaming API calls
    stream = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Write a poem"}],
        stream=True,
        stream_options={"include_usage": True}  # Required for token counting
    )
    
    # Process the stream
    full_response = ""
    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:
            content = chunk.choices[0].delta.content
            full_response += content
            print(content, end="", flush=True)
    
    return full_response

Without setting stream_options={"include_usage": True}, token counts will not be captured for streaming API calls, and your usage metrics in traces will be incomplete.

3. Running Production Evaluations

Optionally, you can run asynchronous evaluations directly inside your traces.

This enables you to run evaluations on your production data in real-time, which can be useful for:

Guardrailing your production system against quality regressions (hallucinations, toxic responses, revealing private data, etc.).
Exporting production data for offline experimentation (e.g for A/B testing your workflow versions on relevant use cases).
Getting actionable insights on how to fix common failure modes in your workflow (e.g. missing knowledge base info, suboptimal prompts, etc.).

To execute an asynchronous evaluation, you can use the trace.asyncEvaluate() method (Typescript) or judgment.async_evaluate() (Python, assuming it operates on the currently active trace).

from judgeval.tracer import Tracer
from judgeval.scorers import AnswerRelevancyScorer
from judgeval.data import Example

judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="function")
def main():
    query = "What is the capital of France?"
    res = "The capital of France is Paris."  # Replace with your workflow logic
    
    # Create an Example object to pass to async_evaluate
    example = Example(
        input=query,
        actual_output=res
    )
    
    # Run the evaluation with the Example object
    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        example=example,
        model="gpt-4o"
    )
    return res

main() # Call the observed function

Your async evaluations will be logged to the Judgment platform as part of the original trace and a new evaluation will be created on the Judgment platform.

Example: Music Recommendation Agent

In this video, we'll walk through all of the topics covered in this guide by tracing over a simple OpenAI API-based music recommendation agent.

Advanced: Customizing Traces Using the Context Manager (Python) / Explicit Trace Client (Typescript)

In Python, if you need to customize your tracing context beyond the implicit behavior of @observe, you can use the with judgment.trace() context manager. In Typescript, you achieve similar control by explicitly creating a TraceClient instance using judgment.startTrace() and manually calling methods like save() or print() on it.

The explicit trace client allows you to save or print the state of the trace at any point in the workflow. This is useful for debugging or exporting any state of your workflow to run an evaluation from!

Any functions wrapped with judgment.observe() called within the scope where the TraceClient is active will automatically be associated with that trace.

Here's an example of using explicit trace management:

from judgeval.tracer import Tracer, wrap
from openai import OpenAI

judgment = Tracer(project_name="my_project")
client = wrap(OpenAI())

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

def main():
    with judgment.trace(name="my_workflow") as trace:
        res = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"{my_tool()}"}]
        )
    
        trace.print()  # prints the state of the trace to console
        trace.save()  # saves the current state of the trace to the Judgment platform
        # Note: Python trace context likely saves automatically on exit

    return res.choices[0].message.content

In Python, the with judgment.trace() context manager should only be used if you need fine-grained control over the trace lifecycle. In Typescript, explicit management via startTrace() and trace.save() is the standard way to gain this control. In most simple cases in Python, the @observe decorator is sufficient.

Tracing

On this page