Regression Testing
Use evals as regression tests in your CI pipelines.
judgeval enables you to unit test your agent against predefined tasks/inputs, with built-in support for common
unit testing frameworks like pytest.
Quickstart
You can formulate evals as unit tests by checking if scorers exceed or fall below threshold values on a set of examples (test cases).
Setting assert_test=True in client.run_evaluation() runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")
res = client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
assert_test=True
)If an example fails, the test will report the failure like this:
title="Test output"
================================================================================
⚠️ TEST RESULTS: 0/1 passed (1 failed)
================================================================================
✗ Test 1: FAILED
Scorer: Resolution Scorer
Score: 0.0
Reason: The response does not contain the word 'package'Pytest Integration
judgeval integrates with pytest so you don't have to write any additional scaffolding for your agent unit tests.
We'll reuse the code above and now expect a failure with pytest by running uv run pytest unit_test.py:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentTestError
import pytest
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")
def test_agent_behavior():
with pytest.raises(JudgmentTestError):
client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
assert_test=True
)