Introduction to Judges
How to build and use judges to track agent behavioral regressions
Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining Code/LLM judges, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.
LLM Judges
LLM Judges (Prompt Scorers) use natural language rubrics for LLM-as-judge evaluation.
- Define criteria using plain language on the platform


Code Judges
Code Judges (Custom Scorers) implement arbitrary scoring logic in Python code.
- Full flexibility with any LLM, library, or custom logic
- Server-hosted execution for production monitoring


Regression Testing
Regression Testing runs evals as unit tests in CI/CD pipelines.
- Integrates with pytest and standard testing frameworks
- Automatically fails when scores drop below thresholds


Quickstart
Build and test judges:
from judgeval import Judgeval
from judgeval.v1.data.example import Example
client = Judgeval(project_name="default_project")
test_examples = [
Example.create(
input="What is the capital of the United States?",
actual_output="The capital of the U.S. is Washington, D.C."
),
Example.create(
input="What is the capital of the United States?",
actual_output="I think it's New York City."
)
]
results = client.evaluation.create().run(
examples=test_examples,
scorers=["AccuracyScorer"],
eval_run_name="accuracy_test"
)Next Steps
- Code Judges (Custom Scorers) - Code-defined judges using any LLM or library dependency
- LLM Judges (Prompt Scorers) - LLM-as-a-judge judges defined by custom rubrics on the platform
- Monitor Agent Behavior in Production - Use judges to monitor your agents performance in production.

