Introduction to Agent Scorers

How to build and use scorers to track agent behavioral regressions

Evaluation provides tools to measure agent behavior, prevent regressions, and maintain quality at scale. By combining custom scorers, prompt-based evaluation, datasets, and testing frameworks, you can systematically track behavioral drift before it impacts production.


Prompt Scorers

Prompt Scorers use natural language rubrics for LLM-as-judge evaluation.

  • Define criteria using plain language on the platform
  • TracePromptScorers evaluate full traces instead of individual examples

Custom Scorers

Custom Scorers implement arbitrary scoring logic in Python code.

  • Full flexibility with any LLM, library, or custom logic
  • Server-hosted execution for production monitoring

Datasets

Datasets group examples for batch evaluation and team collaboration.

  • Import from JSON/YAML or export to HuggingFace
  • Automatically synced with the Judgment platform

Regression Testing

Regression Testing runs evals as unit tests in CI/CD pipelines.

  • Integrates with pytest and standard testing frameworks
  • Automatically fails when scores drop below thresholds

Quickstart

Build and test scorers:

custom_rubric.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

# Define your own data structure
class QuestionAnswer(Example):
    question: str
    answer: str

# Create your behavioral rubric
class AccuracyScorer(ExampleScorer):
    name: str = "Accuracy Scorer"

    async def a_score_example(self, example: QuestionAnswer):
        # Custom scoring logic for agent behavior
        # You can import dependencies, combine LLM judges with logic, and more
        if "washington" in example.answer.lower():
            self.reason = "Answer correctly identifies Washington"
            return 1.0
        else:
            self.reason = "Answer doesn't mention Washington"
            return 0.0

# Test your rubric on examples
test_examples = [
    QuestionAnswer(
        question="What is the capital of the United States?",
        answer="The capital of the U.S. is Washington, D.C."
    ),
    QuestionAnswer(
        question="What is the capital of the United States?",
        answer="I think it's New York City."
    )
]

# Test your rubric
results = client.run_evaluation(
    examples=test_examples,
    scorers=[AccuracyScorer()],
    project_name="default_project"
)

Results are automatically saved to your project on the Judgment platform where you can analyze performance across different examples and iterate on your rubrics.


Next Steps