Introduction to Agentic Rubrics
How to build and use rubrics to track agent behavioral regressions
Agent behavior rubrics are systematic scorer frameworks that measure how your AI agents behaves and performs in production with customers.
Quickstart
Build and iterate on your agent behavior rubrics to measure how your agents perform across specific behavioral dimensions:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
# Define your data structure
class QuestionAnswer(Example):
question: str
answer: str
# Create your behavioral rubric
class AccuracyScorer(ExampleScorer):
name: str = "Accuracy Scorer"
async def a_score_example(self, example: QuestionAnswer):
# Custom scoring logic for agent behavior
# You can import dependencies, combine LLM judges with logic, and more
if "washington" in example.answer.lower():
self.reason = "Answer correctly identifies Washington"
return 1.0
else:
self.reason = "Answer doesn't mention Washington"
return 0.0
# Test your rubric on examples
test_examples = [
QuestionAnswer(
question="What is the capital of the United States?",
answer="The capital of the U.S. is Washington, D.C."
),
QuestionAnswer(
question="What is the capital of the United States?",
answer="I think it's New York City."
)
]
# Test your rubric
results = client.run_evaluation(
examples=test_examples,
scorers=[AccuracyScorer()],
project_name="default_project"
)
Evals in judgeval
consist of three components:
Example
objects contain the fields involved in the eval.
Scorer
objects contain the logic to score agent executions using code + LLMs or scoring rubrics.
Use a combination of LLM's and/or arbitrarily defined code to score your agents' runs.
Why use behavioral rubrics?
Agent behavior drifts as models evolve and new customer use cases emerge. Without systematic monitoring, you'll discover failures after customers complain—like when your support agent hallucinates product information and tells your customers to buy your competitor.
Behavioral rubrics catch failures before users do by detecting these misbehaviors in real-time, preventing customer churn while scaling confidently across growing user bases and reducing manual QA overhead.
judgeval
helps you create custom scoring rubrics that prevent churn by ensuring your agents behave as intended.
Build behavioral rubrics around three critical areas that directly impact customer satisfaction:
- Helpful responses - ensuring agents provide genuinely useful answers that address customer needs
- Factual accuracy - maintaining correctness specific to your agent's domain and tasks
- Seamless user experience - creating frictionless interactions that feel natural and efficient
Run these scorers in production to detect agent misbehavior, get instant alerts, and push fixes quickly while easily surfacing your agents' failure patterns for analysis.