Get Started

Judgeval is an open-source SDK for agent improvement. It provides tracing and agent-judge evaluation for LLM-powered applications — so you can detect failures, understand what went wrong, and validate fixes against real production cases before shipping.

Judgeval is built and maintained by Judgment Labs. You can follow our latest updates via GitHub.

Use your coding agent

Install the Judgment skill so your coding agent can add tracing, evaluations, code judges, and monitoring with Judgment best practices.

Use your coding agent with this instruction so it can install the Judgment skill and apply it to your task.

Install the Judgment skill from github.com/JudgmentLabs/skills
and use it to add tracing to this application
following Judgment best practices.

Add the Judgment skill yourself:

npx skills add JudgmentLabs/skills --skill "judgment"

Then give your agent the task:

Add Judgment tracing to this application following best practices.

Judgment docs remain the source of truth for exact SDK signatures and API details. The Judgment skill adds workflow guidance: where to initialize tracing, which code paths to instrument, what mistakes to avoid, and how to verify traces appear.

Manual installation

This manual guide walks through the core features of Judgeval and the Judgment Platform. By the end, you'll be familiar with the core concepts and be able to start monitoring your agents in production.

Install judgeval

uv add judgeval

pip install judgeval

npm install judgeval

yarn add judgeval

pnpm add judgeval

bun add judgeval

Get your API keys

To get access to the Judgment Platform, book a demo on our website or reach out at contact@judgmentlabs.ai. Once you have an account, copy your API key and Organization ID and set them as environment variables.

Get your API keys

.env

JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"

Trace your Agent

Tracing captures your agent's inputs, outputs, tool calls, and LLM calls to help you debug and analyze agent behavior.

Note: This example uses OpenAI. Make sure you have OPENAI_API_KEY set in your environment variables before running.

To properly trace your agent, you need to:

Use @Tracer.observe() decorator on all functions and tools of your agent
Use wrap() to instrument all LLM client calls (e.g., wrap(OpenAI()))

trace_agent.py

from openai import OpenAI
from judgeval import Tracer, wrap
import time

Tracer.init(project_name="default_project")  # organizes traces
client = wrap(OpenAI())  # tracks all LLM calls


@Tracer.observe(span_type="tool") 
def format_task(question: str) -> str:
    time.sleep(0.5)  # Simulate some processing delay
    return f"Please answer the following question: {question}"


@Tracer.observe(span_type="tool") 
def answer_question(prompt: str) -> str:
    time.sleep(0.3)  # Simulate some processing delay
    response = client.chat.completions.create(
        model="gpt-5.2", messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


@Tracer.observe(span_type="function") 
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)
    return answer


if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

To properly trace your agent, you need to:

Use Tracer.observe(...) to wrap all functions and tools of your agent

traceAgent.ts

import { Tracer } from "judgeval";
import OpenAI from "openai";

const openai = new OpenAI({
    apiKey: process.env.OPENAI_API_KEY,
});

await Tracer.init({
    projectName: "default_project",
});

const runAgent = Tracer.observe(async function runAgent( 
    question: string
): Promise<string> {
    const task = await formatTask(question);
    const answer = await answerQuestion(task);
    return answer;
},
"function");

const formatTask = Tracer.observe(async function formatTask( 
    question: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 500));
    return `Please answer the following question: ${question}`;
},
"tool");

const answerQuestion = Tracer.observe(async function answerQuestion( 
    prompt: string
): Promise<string> {
    await new Promise((resolve) => setTimeout(resolve, 300));
    return await openAICompletion(prompt);
},
"tool");

const openAICompletion = Tracer.observe(async function openAICompletion( 
    prompt: string
): Promise<string> {
    const response = await openai.chat.completions.create({
        model: "gpt-5.2",
        messages: [{ role: "user", content: prompt }],
    });
    return response.choices[0]?.message.content || "No answer";
},
"llm");

await runAgent("What is the capital of the United States?");
await Tracer.shutdown();

Congratulations! You've just created your first trace. It should look like this:

Create a Behavior

A Behavior is a tracked signal derived from a judge — it represents a specific way your agent can act (e.g., being helpful, failing to use tools correctly). Behaviors accumulate over time into a searchable record of how your agent behaves in production.

Navigate to the Behaviors section in the sidebar of the Judgment Platform and click New Behavior.

You'll be prompted to create a new judge or attach to an existing one. Select Create New Judge and configure it:

Name the judge Helpfulness Scorer
Select a judge model (e.g., gpt-5.2)

Set your judge prompt that defines the behavior to evaluate:

Does the agent call relevant tools effectively to help the user with their request?

Click Create to save your behavior. The Helpfulness Scorer judge will automatically score incoming traces and surface detected behaviors in the platform.

Monitor Your Agent

No code changes needed. Run your agent again with the same code from the previous step — incoming traces will automatically be scored by the Helpfulness Scorer and any detected behaviors will appear in the platform.

You can also instrument agents' behavior with judgeval to send alerts when agents are misbehaving in production.

Next Steps

Congratulations! You've just finished getting started with judgeval and the Judgment Platform.

Explore our features in more detail below:

Agent Judges - Measure and optimize your agent along any behavioral rubric, using techniques such as LLM-as-a-judge and human-aligned rubrics.
Agent Behavior Monitoring - Take action when your agents misbehave in production: alert your team, add failure cases to datasets for later optimization, and more.

Get Started

Get your API keys

On this page