Judgment Labs Logo

Agent Judges

Use natural language rubrics to evaluate agent behavior with agent-based evaluation.

Agent Judges (PromptScorer in the SDK) use natural language rubrics to evaluate agent behavior using the Judgment agent harness.

You can combine Agent Judges with Code Judges (Custom Scorers) just like any other judge in judgeval. See the SDK Reference for implementation details.

Create an Agent Judge

Within your project, go to the Judges tab in the sidebar.

Navigate to Judges section

Create your judge

Click New Judge. Choose the type of judge (binary, classification, or score).

Judge Type Selection

Then, click Next to configure the judge.

Configure the judge

Name the judge and feel free to add a description.

Judges are versioned, each version has the following configuration options:

Model: Choose your judge model

Prompt: Define your evaluation criteria (supports {{ mustache template }} variables)

Choices: (for categorical judges): Choices that the judge can pick from

Notes: Notes about the specific version

Agent Judge Configuration

Test the judge

Test your Agent Judge with custom inputs from the Playground tab in the right-side panel. You can enter these custom inputs manually or add from a dataset. In addition, you can also add other judges or different versions of the same judge.

Then click Run.

You'll see the judge's score and reasoning for each judge in the output.

Agent Judge Playground

Manage Agent Judges from the SDK

You can also create and update Agent Judges programmatically from the SDK — client.agent_judges in Python, client.agentJudges in TypeScript. This is useful for codifying judge rubrics alongside your agent code or rolling out rubric changes through your existing review process.

Create a judge

create_agent_judge.py
from judgeval import Judgeval

client = Judgeval(project_name="default_project")

judge = client.agent_judges.create(
    name="helpfulness",
    prompt="Score the assistant's helpfulness from 0 to 1.",
    model="gpt-5.2",
    score_type="numeric",
)
create_agent_judge.ts
import { Judgeval } from "judgeval";

const client = await Judgeval.create({ projectName: "default_project" });

const judge = await client.agentJudges.create({
    name: "helpfulness",
    prompt: "Score the assistant's helpfulness from 0 to 1.",
    model: "gpt-5.2",
    scoreType: "numeric",
});

The score type is one of "numeric", "binary", or "categorical". For categorical judges, pass categories (e.g. [{"name": "good", "description": "..."}, ...]); for numeric judges, the min and max score default to 0 and 1.

Update a judge

Updating any of prompt, model, categories, min_score, or max_score writes a new version of the judge. By default the server bumps the latest version's minor by 1 — matching the UI's default "save" behaviour.

update_agent_judge.py
client.agent_judges.update(
    judge_id=judge.judge_id,
    prompt="Updated rubric prompt.",
)

Pass source_major_version / source_minor_version to copy unspecified fields from a specific past version, or target_major_version / target_minor_version to write to a specific version slot.

update_agent_judge.ts
await client.agentJudges.update({
    judgeId: judge.judgeId,
    prompt: "Updated rubric prompt.",
});

Pass sourceMajorVersion / sourceMinorVersion to copy unspecified fields from a specific past version, or targetMajorVersion / targetMinorVersion to write to a specific version slot.


Template Variables

Agent Judge prompts support {{ mustache }} variables. How you use them depends on whether you're running online or offline.

Online judges don't need variables. When a trace is ingested, the judge automatically receives full context — span inputs, outputs, tool calls, and LLM calls — so the rubric can reference your agent's behavior directly without any placeholders.

Offline testing is where variables become useful. A dataset declares typed fields in its schema (e.g. golden_output, expected_category), and each example supplies values for them. Those fields map to {{ variables }} in your judge prompt, letting the judge compare the agent's actual trace output against the expected value. A judge is only compatible with a dataset when every placeholder in its prompt is a declared field in that dataset's schema.

For most online monitoring use cases, skip variables entirely — the judge already knows what your agent did.

See Offline Testing for how datasets, test configs, and runs fit together.


Next Steps