Quickstarts

Installation

pip install judgeval

Our team is always making new releases of the judgeval package!

To get the latest Python version, run pip install judgeval --upgrade.

You can follow our latest updates via our GitHub! If you enjoy using Judgeval, consider giving us a star ⭐!

Our API keys allow you to access the JudgmentClient and Tracer objects which enable you to export your agent traces and evals, manage datasets, and visualize your agent performance on Judgment Platform.

To get your account and organization API keys, create an account for free on the Judgment Platform.

export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"

The Judgeval package natively integrates tracing, tests, and evals with the Judgment Platform. You can sign up for free and view your data on our dashboards in seconds!

For assistance with your registration and setup, such as dealing with sensitive data that has to reside in your private VPCs, feel free to get in touch with our team.

Tracing

Traces enable you to monitor your agent system's execution and measure telemetry such as cost, latency, and quality metrics such as tool error rates, hallucination, and task completion.

trace.py

from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"{task_input}"}]
    )
    return res.choices[0].message.content

# Calling the observed function implicitly starts and saves the trace
main()

Congratulations! You've just created your first trace. It should look like this:

Alt text

There are many benefits of monitoring your agent systems with judgeval tracing, including:

Debugging agent workflows in seconds with full observability
Using production workflow data to create experimental datasets for future improvement/optimization
Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, tool error rates, etc.)

To learn more about judgeval's tracing module, click here.

Evaluation

You can run evals on predetermined examples with any of the judgeval scorers or your own custom scorers. Evals produce a score for each example and you can run multiple scorers on the same examples.

eval.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
)
print(results)

Congratulations! Your evaluation should have passed. Let's break down what happened.

The variable input mimics an input and actual_output is a placeholder for what your agent system returns based on the input.
The variable retrieval_context (Python) or context (Typescript) represents the retrieved context from your RAG knowledge base.
FaithfulnessScorer(threshold=0.5) is a scorer that checks if the output is hallucinated relative to the retrieved context. The threshold is used in the context of unit testing.
We chose gpt-4.1 as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!

To learn more about using the Judgment Client to run evaluations, click here.

Did you know that you can also run online evals on your traces in production? Click here to learn more.

Unit Testing

Judgeval natively integrates with pytest to enable you to use evals as unit tests in your CI pipelines. Here's an example of a simple tool calling agent that we can use to test our tool calling accuracy:

my_agent.py

from judgeval.tracer import Tracer

judgment = Tracer(project_name="my_agent")

class MyAgent:  # sample agent, replace with your own
    @judgment.observe(span_type="tool")
    def get_attractions(self, destination: str) -> str:
        """Get attractions for a destination"""
        pass

    @judgment.observe(span_type="tool")
    def get_weather(self, destination: str, start_date: str, end_date: str) -> str:
        """Get weather forecast for a destination"""
        pass

    @judgment.observe(span_type="function")
    def run_agent(self, prompt: str) -> str:
        """Run the agent with the given prompt"""
        pass

The code below will invoke your agent to test whether the right tools are called given the input prompt.

tool_order.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ToolOrderScorer

client = JudgmentClient()

# Define example with expected tool-calling sequence
example = Example(
    input={"prompt": "What's the attraction and weather in Paris for early June?"},
    expected_tools=[
        {
            "tool_name": "get_attractions",
            "parameters": {
                "destination": "Paris"
            }
        },
        {
            "tool_name": "get_weather",
            "parameters": {
                "destination": "Paris",
                "start_date": "2025-06-01",
                "end_date": "2025-06-02"
            }
        }
    ])

scorer = ToolOrderScorer(exact_match=True)
agent = MyAgent()  # replace with your agent

results = client.assert_test(
    examples=[example],
    scorers=[scorer],
    function=agent.run_agent  # replace with your agent's method you want to test
)

Next Steps

Congratulations! You've just finished getting started with judgeval and the Judgment Platform.

For a deeper dive into using judgeval, learn more about evals, unit testing, and tracing!