Introduction to Agent Scorers
How to build and use scorers to track agent behavioral regressions
Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining custom scorers, prompt-based evaluation, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.
Prompt Scorers
Prompt Scorers use natural language rubrics for LLM-as-judge evaluation.
- Define criteria using plain language on the platform
- TracePromptScorers evaluate full traces instead of individual examples
Custom Scorers
Custom Scorers implement arbitrary scoring logic in Python code.
- Full flexibility with any LLM, library, or custom logic
- Server-hosted execution for production monitoring
Regression Testing
Regression Testing runs evals as unit tests in CI/CD pipelines.
- Integrates with pytest and standard testing frameworks
- Automatically fails when scores drop below thresholds
Quickstart
Build and test scorers:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
# Define your own data structure
class QuestionAnswer(Example):
question: str
answer: str
# Create your behavioral rubric
class AccuracyScorer(ExampleScorer):
name: str = "Accuracy Scorer"
async def a_score_example(self, example: QuestionAnswer):
# Custom scoring logic for agent behavior
# You can import dependencies, combine LLM judges with logic, and more
if "washington" in example.answer.lower():
self.reason = "Answer correctly identifies Washington"
return 1.0
else:
self.reason = "Answer doesn't mention Washington"
return 0.0
# Test your rubric on examples
test_examples = [
QuestionAnswer(
question="What is the capital of the United States?",
answer="The capital of the U.S. is Washington, D.C."
),
QuestionAnswer(
question="What is the capital of the United States?",
answer="I think it's New York City."
)
]
# Test your rubric
results = client.run_evaluation(
examples=test_examples,
scorers=[AccuracyScorer()],
project_name="default_project"
)Next Steps
- Custom Scorers - Code-defined scorers using any LLM or library dependency
- Prompt Scorers - LLM-as-a-judge scorers defined by custom rubrics on the platform
- Monitor Agent Behavior in Production - Use scorers to monitor your agents performance in production.