Regression Testing

Use evals as regression tests in your CI pipelines.

judgeval enables you to unit test your agent against predefined tasks/inputs, with built-in support for common unit testing frameworks like pytest.

Quickstart

You can formulate evals as unit tests by checking if scorers exceed or fall below threshold values on a set of examples (test cases).

Setting assert_test=True in client.run_evaluation() runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.

unit_test.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")

res = client.run_evaluation(
    examples=[example],
    scorers=[ResolutionScorer()],
    project_name="default_project",
    assert_test=True
)

If an example fails, the test will report the failure like this:

title="Test output"

================================================================================
⚠️ TEST RESULTS: 0/1 passed (1 failed)
================================================================================

 Test 1: FAILED
Scorer: Resolution Scorer
Score: 0.0
Reason: The response does not contain the word 'package'

Unit tests are treated as evals and the results are saved to your projects on the Judgment platform.

Pytest Integration

judgeval integrates with pytest so you don't have to write any additional scaffolding for your agent unit tests.

We'll reuse the code above and now expect a failure with pytest by running uv run pytest unit_test.py:

unit_test.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentTestError
import pytest

client = JudgmentClient()

class CustomerRequest(Example):
  request: str
  response: str

class ResolutionScorer(ExampleScorer):
  name: str = "Resolution Scorer"

  async def a_score_example(self, example: CustomerRequest):
      # Replace this logic with your own scoring logic
      if "package" in example.response:
          self.reason = "The response contains the word 'package'"
          return 1
      else:
          self.reason = "The response does not contain the word 'package'"
          return 0

example = CustomerRequest(request="Where is my package?", response="Your P*CKAG* will arrive tomorrow at 10:00 AM.")

def test_agent_behavior():
    with pytest.raises(JudgmentTestError):  
        client.run_evaluation(
            examples=[example],
            scorers=[ResolutionScorer()],
            project_name="default_project",
            assert_test=True
        )