Introduction to Judges

How to build and use judges to track agent behavioral regressions

Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining Code/LLM judges, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.


LLM Judges

LLM Judges (Prompt Scorers) use natural language rubrics for LLM-as-judge evaluation.

  • Define criteria using plain language on the platform
LLM Judge

Code Judges

Code Judges (Custom Scorers) implement arbitrary scoring logic in Python code.

  • Full flexibility with any LLM, library, or custom logic
  • Server-hosted execution for production monitoring
Code Judge

Datasets

Datasets group examples for batch evaluation and team collaboration.

  • Import from JSON/YAML or export to HuggingFace
  • Automatically synced with the Judgment platform
Datasets

Regression Testing

Regression Testing runs evals as unit tests in CI/CD pipelines.

  • Integrates with pytest and standard testing frameworks
  • Automatically fails when scores drop below thresholds
Datasets

Quickstart

Build and test judges:

llm_judge_evaluation.py
from judgeval import Judgeval
from judgeval.v1.data.example import Example

client = Judgeval(project_name="default_project")

test_examples = [
    Example.create(
        input="What is the capital of the United States?",
        actual_output="The capital of the U.S. is Washington, D.C."
    ),
    Example.create(
        input="What is the capital of the United States?",
        actual_output="I think it's New York City."
    )
]

results = client.evaluation.create().run(
    examples=test_examples,
    scorers=["AccuracyScorer"],
    eval_run_name="accuracy_test"
)

Results are automatically saved to your project on the Judgment platform where you can analyze performance across different examples and iterate on your rubrics.


Next Steps

On this page