Judgment Labs Logo

Agent Judges

Use natural language rubrics to evaluate agent behavior with agent-based evaluation.

Agent Judges (PromptScorer in the SDK) use natural language rubrics to evaluate agent behavior using the Judgment agent harness.

You can combine Agent Judges with Code Judges (Custom Scorers) just like any other judge in judgeval. See the SDK Reference for implementation details.

Create an Agent Judge

Within your project, go to the Judges tab in the sidebar.

Navigate to Judges section

Create your judge

Click New Judge. Choose the type of judge (binary, classification, or score).

Judge Type Selection

Then, click Next to configure the judge.

Configure the judge

Name the judge and feel free to add a description.

Judges are versioned, each version has the following configuration options:

Model: Choose your judge model

Prompt: Define your evaluation criteria (supports {{ mustache template }} variables)

Choices: (for categorical judges): Choices that the judge can pick from

Notes: Notes about the specific version

Agent Judge Configuration

Test the judge

Test your Agent Judge with custom inputs from the Playground tab in the right-side panel. You can enter these custom inputs manually or add from a dataset. In addition, you can also add other judges or different versions of the same judge.

Then click Run.

You'll see the judge's score and reasoning for each judge in the output.

Agent Judge Playground

Manage Agent Judges from the SDK

You can also create and update Agent Judges programmatically with the Python SDK via client.agent_judges. This is useful for codifying judge rubrics alongside your agent code or rolling out rubric changes through your existing review process.

Create a judge

create_agent_judge.py
from judgeval import Judgeval

client = Judgeval(project_name="default_project")

judge = client.agent_judges.create(
    name="helpfulness",
    prompt="Score the assistant's helpfulness from 0 to 1.",
    model="gpt-5.2",
    score_type="numeric",
)

score_type is one of "numeric", "binary", or "categorical". For categorical judges, pass categories (e.g. [{"name": "good", "description": "..."}, ...]); for numeric judges, min_score and max_score default to 0 and 1.

Update a judge

Updating any of prompt, model, categories, min_score, or max_score writes a new version of the judge. By default the server bumps the latest version's minor by 1 — matching the UI's default "save" behaviour.

update_agent_judge.py
client.agent_judges.update(
    judge_id=judge.judge_id,
    prompt="Updated rubric prompt.",
)

Pass source_major_version / source_minor_version to copy unspecified fields from a specific past version, or target_major_version / target_minor_version to write to a specific version slot.


Template Variables

Agent Judge prompts support {{ mustache }} variables. How you use them depends on whether you're running online or offline.

Online judges don't need variables. When a trace is ingested, the judge automatically receives full context — span inputs, outputs, tool calls, and LLM calls — so the rubric can reference your agent's behavior directly without any placeholders.

Offline testing is where variables become useful. When running against a golden dataset with OfflineTracer, you can attach static fields to each example (e.g. golden_output, expected_category) via example_fields. Those fields map to {{ variables }} in your judge prompt, letting the judge compare the agent's actual trace output against the expected value.

For most online monitoring use cases, skip variables entirely — the judge already knows what your agent did.

See Offline Testing for how to attach example_fields to traces and run batch evaluations.


Next Steps