Introduction to Judges
How to build and use judges to track agent behavioral regressions
Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining Code/Agent judges, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.
Agent Judges
Agent Judges use natural language rubrics for agent-based evaluation.
- Define criteria using plain language on the platform


Code Judges
Code Judges (Custom Scorers) implement arbitrary scoring logic in Python code.
- Full flexibility with any LLM, library, or custom logic
- Server-hosted execution for production monitoring


Offline Testing
Offline Testing runs agents over eval sets with OfflineTracer and scores traces in batch.
- Collect traces without affecting live monitoring
- Score in batch with any judge; assert failures in CI


Quickstart
Build and test judges:
from judgeval import Judgeval
from judgeval.data.example import Example
client = Judgeval(project_name="default_project")
test_examples = [
Example.create(
input="What is the capital of the United States?",
actual_output="The capital of the U.S. is Washington, D.C."
),
Example.create(
input="What is the capital of the United States?",
actual_output="I think it's New York City."
)
]
results = client.evaluation.create().run(
examples=test_examples,
scorers=["AccuracyScorer"],
eval_run_name="accuracy_test"
)Next Steps
- Code Judges (Custom Scorers) - Code-defined judges using any LLM or library dependency
- Agent Judges - Agent-based judges defined by custom rubrics on the platform
- Monitor Agent Behavior in Production - Use judges to monitor your agents performance in production.

