Judgment Labs Logo

CI/CD Pipelines & Unit Testing

CI pipelines are the core of all mature software engineering practices.

With LLMs, developers should expect nothing less. Using judgeval, you can easily unit test your LLM applications for consistency and quality in any metric of your choice.

Unit testing is natively supported in judgeval through the client.assert_test (Python) or client.assertTest (Typescript) method. This also integrates with popular testing frameworks like pytest (Python) or jest/vitest (Typescript), meaning you won't have to learn any new testing frameworks!


Single Step Testing

import pytest
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

def test_faithfulness():
    client = JudgmentClient()
    
    example = Example(
        input="What is the capital of France?",
        actual_output="The capital of France is Lyon.", # Hallucinated output
        retrieval_context=["Come tour Paris' museums in the capital of France!"],
    )

    # Example contains a hallucination, so we should expect an exception/assertion error
    # when the threshold is 1.0 (expecting perfect faithfulness)
    with pytest.raises(AssertionError):
        client.assert_test(
            eval_run_name="test_faithfulness_fail",
            examples=[example],
            scorers=[FaithfulnessScorer(threshold=1.0)],
            model="gpt-4o" # Added model parameter
        )
    
    # This should pass as the threshold is low
    client.assert_test(
        eval_run_name="test_faithfulness_pass",
        examples=[example],
        scorers=[FaithfulnessScorer(threshold=0.1)],
        model="gpt-4.1"
    )

judgeval naturally integrates into your CI pipelines, allowing you to execute robust unit tests across your entire codebase. This allows you to catch regressions in your LLM applications before they make it to production!


Agentic Testing

Developers can also set up agentic unit testing using judgeval, ensuring that each agent routes to the correct tool.

As an example, we will use the Tool Scorer Metric, which compares the agent tool calling with the correct tools to call. Click for more information on the Tool Scorer Metric.

Sample Implementation

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ToolOrderScorer

client = JudgmentClient()

# Define example with expected tool sequence
example = Example(
    input={"prompt": "What's the attraction and weather in Paris for early June?"},
    expected_tools=[
        {
            "tool_name": "get_attractions",
            "parameters": {
                "destination": "Paris"
            }
        },
        {
            "tool_name": "get_weather",
            "parameters": {
                "destination": "Paris",
                "start_date": "2025-06-01",
                "end_date": "2025-06-02"
            }
        }
    ])

scorer = ToolOrderScorer(exact_match=True)
agent = MyAgent()

# Run evaluation
results = client.assert_test(
    examples=[example],
    scorers=[scorer],
    function=agent.run_agent,
)

You can also define your test cases in a YAML file:

# tests.yaml
examples:
  - input:
      prompt: "What's the attraction and weather in Paris for early June?"
    expected_tools:
      - tool_name: "get_attractions"
        parameters:
            destination: "Paris"
      - tool_name: "get_weather"
        parameters:
            destination: "Paris"
            start_date: "2025-06-01"
            end_date: "2025-06-02"

Then run the evaluation using the YAML file:

client.assert_test(
    test_file="test.yaml",
    scorers=[scorer],
    function=agent.run_agent,
)