Unit Testing

judgeval enables you to unit test your agent against predefined tasks/inputs, with built-in support for common unit testing frameworks like pytest.

Quickstart

You can formulate evals as unit tests by checking if scorers exceed or fall below threshold values on a set of examples (test cases).

The client.assert_test() runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.

unit_test.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

agent = ...  # your agent
task = "What if these shoes don't fit?"
example = Example(
    input=task,
    actual_output=agent.run(task),  # e.g. "We offer a 30-day full refund at no extra cost."
    retrieval_context=["All customers are eligible for a 44 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
client.assert_test(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
    project_name="my-first-test"
)

Let's break down how assert_test() works:

assert_test() will take each example from examples and pass the input to function as kwargs (in this case, the function is generate_itinerary) to generate an output. Then, the scorers will be applied to the output and if the score falls below the defined threshold, the test will fail with an exception raised.

If an example fails, the test will report the failure like this:

============================================================================
⚠️  TEST RESULTS: 0/1 passed (1 failed)
============================================================================

✗ Test 1: FAILED
  Scorer: Faithfulness
    Score: 0.0
    Reason: The score is 0.00 because the output repeatedly misrepresented the refund policy, incorrectly stating it as 30 days instead of the 44 days specified in the retrieval context. None of the claims about a 30-day full refund are supported by the actual retrieval context.
  ----------------------------------------

=============================================================================

Unit tests are treated as evals and the results are saved to your projects on the Judgment platform:

Pytest Integration

judgeval integrates with pytest so you don't have to write any additional scaffolding for your agent unit tests. We'll reuse the code above and now expect a failure with pytest:

unit_test.py

import pytest
...
with pytest.raises(AssertionError):
    client.assert_test(
        examples=[example],
        scorers=[scorer],
        model="gpt-4.1",
    )

Test Suites From YAML

If you have a large number of tests, you may find it convenient to store your test cases in a YAML file:

# tests.yaml
examples:
  - input:
      "What if these shoes don't fit?"
    expected_output:
      "We offer a 30-day full refund at no extra cost."
    retrieval_context:
      - "All customers are eligible for a 44 day full refund at no extra cost."

Then run the evaluation using the YAML file:

from judgeval.utils.file_utils import get_examples_from_yaml

client.assert_test(
    examples=get_examples_from_yaml("tests.yaml"),
    scorers=[FaithfulnessScorer(threshold=0.5)],
    model="gpt-4.1",
)

Advanced

Create unit tests for scorers that operate at the trace level

See the API reference for assert_test()