Getting Started

Installation

pip install judgeval

Judgeval runs evaluations that you can manage inside the library. Additionally, you should analyze and manage your evaluations, datasets, and metrics on the natively-integrated Judgment Platform, an all-in-one suite for LLM system evaluation.

Our team is always making new releases of the judgeval package! To get the latest Python version, run pip install --upgrade judgeval. To get the latest Typescript version, run npm update judgeval. You can follow our latest updates via our GitHub

Judgment API Keys

Our API keys allow you to access the JudgmentClient and Tracer which enable you to track your agents and run evaluations on Judgment Labs' infrastructure, access our state-of-the-art judge models, and manage your evaluations/datasets on the Judgment Platform.

To get your account and organization API keys, create an account on the Judgment Platform.

export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"

For assistance with your registration and setup, such as dealing with sensitive data that has to reside in your private VPCs, feel free to get in touch with our team.

Create Your First Experiment

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)
print(results)

Congratulations! Your evaluation should have passed. Let's break down what happened.

The variable input mimics a user input and actual_output is a placeholder for what your LLM system returns based on the input.
The variable retrieval_context (Python) or context (Typescript) represents the retrieved context from your RAG knowledge base.
FaithfulnessScorer(threshold=0.5) is a scorer that checks if the output is hallucinated relative to the retrieved context.
- The threshold is used in the context of unit testing.
We chose gpt-4o as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!

To learn more about using the Judgment Client to run evaluations, click here.

Create Your First Trace

judgeval traces enable you to monitor your LLM systems in online development and production stages. Traces enable you to track your LLM system's flow end-to-end and measure:

LLM costs
Workflow latency
Quality metrics, such as hallucination, retrieval quality, and more.

from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"{task_input}"}]
    )
    return res.choices[0].message.content

# Calling the observed function implicitly starts and saves the trace
main()

Congratulations! You've just created your first trace. It should look like this:

Alt text

There are many benefits of monitoring your LLM systems with judgeval tracing, including:

Debugging LLM workflows in seconds with full observability
Using production workflow data to create experimental datasets for future improvement/optimization
Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, hallucination, etc.)

To learn more about judgeval's tracing module, click here.

Automatic Deep Tracing

Judgeval supports automatic deep tracing, which significantly reduces the amount of instrumentation needed in your code. With deep tracing enabled (which is the default), you only need to observe top-level functions, and all nested function calls will be automatically traced.

from judgeval.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

# Define a function that will be automatically traced when called from main
def helper_function():
    return "This will be traced automatically"

# Only need to observe the top-level function
@judgment.observe(span_type="function")
def main():
    # helper_function will be automatically traced without @observe
    result = helper_function()
    res = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": result}]
    )
    return res.choices[0].message.content

main()

To disable deep tracing, initialize the tracer with deep_tracing=False. You can still name and declare span types for each function using jdugement.observe().

Create Your First Online Evaluation

In addition to tracing, judgeval allows you to run online evaluations on your LLM systems. This enables you to:

Catch real-time quality regressions to take action before customers are impacted
Gain insights into your agent performance in real-world scenarios

To run an online evaluation, you can simply add one line of code to your existing trace:

from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from judgeval.data import Example
from openai import OpenAI

client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def my_tool():
    return "Hello world!"

@judgment.observe(span_type="function")
def main():
    task_input = my_tool()
    res = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"{task_input}"}]
    ).choices[0].message.content

    example = Example(
        input=task_input,
        actual_output=res
    )
    # In Python, this likely operates on the implicit trace context
    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.5)],
        example=example,
        model="gpt-4o"
    )

    return res

main()

Online evaluations are automatically logged to the Judgment Platform as part of your traces. You can view them by navigating to your trace and clicking on the trace span that contains the online evaluation. If there is a quality regression, the UI will display an alert, like this:

Alt text

Optimizing Your LLM System

Evaluation and monitoring are the building blocks for optimizing LLM systems. Measuring the quality of your LLM workflows allows you to compare design iterations and ultimately find the optimal set of prompts, models, RAG architectures, etc. that make your LLM excel in your production use cases.

A typical experimental setup might look like this:

Create a new Project in the Judgment platform by either running an evaluation from the SDK or via the platform UI. This will help you keep track of all evaluations and traces for different iterations of your LLM system.

A Project keeps track of Experiments and Traces relating to a specific workflow. Each Experiment contains a set of Scorers that have been run on a set of Examples.

You can create separate Experiments for different iterations of your LLM system, allowing you to independently test each component of your LLM system.

You can try different models (e.g. gpt-4o, claude-3-5-sonnet, etc.) and prompt templates in each Experiment to find the optimal setup for your LLM system.

Next Steps

Congratulations! You've just finished getting started with judgeval and the Judgment Platform.

For a deeper dive into using judgeval, learn more about experiments, unit testing, and monitoring!

Getting Started

On this page