Judgment Labs Logo

Introduction to Judges

How to build and use judges to track agent behavioral regressions

Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining Code/Agent judges, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.


Agent Judges

Agent Judges use natural language rubrics for agent-based evaluation.

  • Define criteria using plain language on the platform
Agent Judge

Code Judges

Code Judges (Custom Scorers) implement arbitrary scoring logic in Python code.

  • Full flexibility with any LLM, library, or custom logic
  • Server-hosted execution for production monitoring
Code Judge

Datasets

Datasets group examples for batch evaluation and team collaboration.

  • Import from JSON/YAML or export to HuggingFace
  • Automatically synced with the Judgment platform
Datasets

Offline Testing

Offline Testing runs agents over eval sets with OfflineTracer and scores traces in batch.

  • Collect traces without affecting live monitoring
  • Score in batch with any judge; assert failures in CI
Datasets

Quickstart

Build and test judges:

agent_judge_evaluation.py
from judgeval import Judgeval
from judgeval.data.example import Example

client = Judgeval(project_name="default_project")

test_examples = [
    Example.create(
        input="What is the capital of the United States?",
        actual_output="The capital of the U.S. is Washington, D.C."
    ),
    Example.create(
        input="What is the capital of the United States?",
        actual_output="I think it's New York City."
    )
]

results = client.evaluation.create().run(
    examples=test_examples,
    scorers=["AccuracyScorer"],
    eval_run_name="accuracy_test"
)

Results are automatically saved to your project on the Judgment platform where you can analyze performance across different examples and iterate on your rubrics.


Next Steps

On this page