Get Started

Judgeval is an Agent Behavior Monitoring (ABM) library that helps track and judge any agent behavior in online and offline environments. Judgeval also provides error analysis on agent trajectories and groups trajectories by behavior and topic for deeper analysis.

Judgeval is built and maintained by Judgment Labs. You can follow our latest updates via GitHub.

Quickstart

This quickstart will guide you through the core features of Judgeval and the Judgment Platform. By the end, you'll be familiar with the core concepts and be able to start monitoring your agents in production.

Install judgeval

uv add judgeval
pip install judgeval
npm install judgeval
yarn add judgeval
pnpm add judgeval
bun add judgeval

Get your API keys

Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID and set them as environment variables.

Get your free API keys

You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.

.env
JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"

Trace your Agent

Tracing captures your agent's inputs, outputs, tool calls, and LLM calls to help you debug and analyze agent behavior.

Note: This example uses OpenAI. Make sure you have OPENAI_API_KEY set in your environment variables before running.

To properly trace your agent, you need to:

  • Use @judgment.observe() decorator on all functions and tools of your agent
  • Use wrap() to instrument all LLM client calls (e.g., wrap(OpenAI()))
trace_agent.py
from openai import OpenAI
from judgeval.tracer import Tracer, wrap
import time

judgment = Tracer(project_name="default_project")  # organizes traces
client = wrap(OpenAI())  # tracks all LLM calls

@judgment.observe(span_type="tool") 
def format_task(question: str) -> str:
    time.sleep(0.5)  # Simulate some processing delay
    return f"Please answer the following question: {question}"

@judgment.observe(span_type="tool") 
def answer_question(prompt: str) -> str:
    time.sleep(0.3) # Simulate some processing delay
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@judgment.observe(span_type="function") 
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)
    return answer

if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

To properly trace your agent, you need to:

  • Use judgment.observe(...) to wrap all functions and tools of your agent
traceAgent.ts
import { Judgeval } from "judgeval";
import OpenAI from "openai";

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

const client = Judgeval.create();

const tracer = await client.nodeTracer.create({
    projectName: "default_project",
});

const runAgent = tracer.observe(async function runAgent( 
    question: string
): Promise<string> {
    const task = await formatTask(question);
    const answer = await answerQuestion(task);
    return answer;
},
"function");

const formatTask = tracer.observe(async function formatTask( 
    question: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 500));
    return `Please answer the following question: ${question}`;
},
"tool");

const answerQuestion = tracer.observe(async function answerQuestion( 
    prompt: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 300));
    return await openAICompletion(prompt);
},
"tool");

const openAICompletion = tracer.observe(async function openAICompletion( 
    prompt: string
): Promise<string> {
    const response = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
    });
    return response.choices[0]?.message.content || "No answer";
},
"llm");

await runAgent("What is the capital of the United States?");
await tracer.shutdown();

Congratulations! You've just created your first trace. It should look like this:

Create a Behavior Scorer

Online behavioral monitoring lets you run scorers directly on your agents in production. Engineers can be alerted the instant an agent misbehaves and make proactive fixes before customers are affected.

In Judgment, a Trace Prompt Scorer is a special type of scorer that runs on a full trace, given a prompt. You can create one by first navigating to the Scorers section in the sidebar of the Judgment Platform.

Click the New Scorer button, name it Helpfulness Scorer, and select Traces as the Scorer Type.

Configure your scorer:

  • Select a judge model (e.g., gpt-5)
  • Set your scorer prompt that defines the behavior to evaluate:
    Does the agent call relevant tools effectively to help the user with their request?
  • Set the threshold to the default value 0.5
  • Set choice scorers: "No": 0 and "Yes": 1

Finally, click Create Scorer to save your scorer.

Monitor Your Agent Using Trace Prompt Scorers

Modify your agent's tracing to asynchronously evaluate behavior with the Helpfulness Scorer.

Add the scorer_config parameter to a @judgment.observe() decorator.

Adding the scorer_config parameter to the @judgment.observe() decorator of the top-level function that invokes your agent will automatically score it.

trace_agent.py
from openai import OpenAI
from judgeval.tracer import Tracer, wrap 
from judgeval.tracer import Tracer, TraceScorerConfig, wrap 
from judgeval.scorers import TracePromptScorer 
import time

trace_scorer = TracePromptScorer.get(name="Helpfulness Scorer") 

judgment = Tracer(project_name="default_project")  # organizes traces
client = wrap(OpenAI())  # tracks all LLM calls

@judgment.observe(span_type="tool")
def format_task(question: str) -> str:
    time.sleep(0.5)  # Simulate some processing delay
    return f"Please answer the following question: {question}"

@judgment.observe(span_type="tool")
def answer_question(prompt: str) -> str:
    time.sleep(0.3) # Simulate some processing delay
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@judgment.observe(span_type="function") 
@judgment.observe(span_type="function", scorer_config=TraceScorerConfig(scorer=trace_scorer)) 
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)

    return answer

if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

Use the tracer's asyncTraceEvaluate() method within an observed trace span.

traceAgent.ts
import { Judgeval } from "judgeval";
import OpenAI from "openai";

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

const client = Judgeval.create();

const tracer = await client.nodeTracer.create({
    projectName: "default_project",
});

const tracePromptScorer = await client.scorers.tracePromptScorer.get(
    "Helpfulness Scorer"
);

const runAgent = tracer.observe(async function runAgent(
    question: string
): Promise<string> {
    tracer.asyncTraceEvaluate(tracePromptScorer); 

    const task = await formatTask(question);
    const answer = await answerQuestion(task);

    return answer;
},
"function");

const formatTask = tracer.observe(async function formatTask(
    question: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 500));
    return `Please answer the following question: ${question}`;
},
"tool");

const answerQuestion = tracer.observe(async function answerQuestion(
    prompt: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 300));
    return await openAICompletion(prompt);
},
"tool");

const openAICompletion = tracer.observe(async function openAICompletion(
    prompt: string
): Promise<string> {
    const response = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
    });
    return response.choices[0]?.message.content || "No answer";
},
"llm");

await runAgent("What is the capital of the United States?");
await tracer.shutdown();

Now incoming traces will be automatically evaluated with the Helpfulness Scorer. Check out your first agent trace with behavior monitoring in the Judgment Platform:

You can also instrument agents' behavior with judgeval to send alerts when agents are misbehaving in production.

Next Steps

Congratulations! You've just finished getting started with judgeval and the Judgment Platform.

Explore our features in more detail below:

  • Agent Scorers - Measure and optimize your agent along any behaviorial rubric, using techniques such as LLM-as-a-judge and human-aligned rubrics.
  • Agent Behavior Monitoring - Take action when your agents misbehave in production: alert your team, add failure cases to datasets for later optimization, and more.