Agentic Behavioral Scorers
Regression Testing
Use evals as regression tests in your CI pipelines
judgeval
enables you to unit test your agent against predefined tasks/inputs, with built-in support for common
unit testing frameworks like pytest
.
Quickstart
You can formulate evals as unit tests by checking if scorers exceed or fall below threshold values on a set of examples (test cases).
Setting assert_test=True
in client.run_evaluation()
runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")
res = client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
assert_test=True
)
Pytest Integration
judgeval
integrates with pytest
so you don't have to write any additional scaffolding for your agent unit tests.
We'll reuse the code above and now expect a failure with pytest by running uv run pytest unit_test.py
:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentTestError
import pytest
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")
def test_agent_behavior():
with pytest.raises(JudgmentTestError):
client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
assert_test=True
)