Introduction

Overview

Evaluation is the process of scoring an LLM system's outputs with metrics; an evaluation is composed of:

An evaluation dataset
Metrics we are interested in tracking

Examples

In judgeval, an Example is a unit of data that allows you to use evaluation scorers on your LLM system.

from judgeval.data import Example

example = Example(
    input="Who founded Microsoft?",
    actual_output="Bill Gates and Paul Allen.",
    retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
)

In this example, input represents a user talking with a RAG-based LLM application, where actual_output is the output of your chatbot and retrieval_context (Python) or context (Typescript) is the retrieved context.

There are many fields in an Example that can be used in an evaluation. To learn more about the Example class, click here.

Creating an Example allows you to evaluate using judgeval's default scorers:

from judgeval import JudgmentClient
from judgeval.scorers import FaithfulnessScorer
from judgeval.data import Example

# Assume example is defined as above
example = Example(
    input="Who founded Microsoft?",
    actual_output="Bill Gates and Paul Allen.",
    retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
)

client = JudgmentClient()

faithfulness_scorer = FaithfulnessScorer(threshold=0.5)

results = client.run_evaluation(
    examples=[example],
    scorers=[faithfulness_scorer],
    model="gpt-4o",
)

# You also run evaluations asynchronously like so:
results = client.a_run_evaluation(
    examples=[example],
    scorers=[faithfulness_scorer],
    model="gpt-4o",
)
print(results)

Datasets

An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers.

from judgeval.data import Example
from judgeval.data.datasets import EvalDataset

example1 = Example(input="...", actual_output="...", retrieval_context="...")
example2 = Example(input="...", actual_output="...", retrieval_context="...")

dataset = EvalDataset(examples=[example1, example2])

EvalDatasets can be saved (loaded) to (from) disk in csv, yaml, and json format or uploaded to the Judgment platform.

For more information on how to use EvalDatasets, please see the EvalDataset docs.

Then, you can run evaluations on the dataset:

from judgeval import JudgmentClient
from judgeval.scorers import FaithfulnessScorer
# Assume dataset is defined as above

client = JudgmentClient()
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=dataset.examples,
    scorers=[scorer],
    model="Qwen/Qwen2.5-72B-Instruct-Turbo",
)

Metrics

judgeval comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through judgeval's Scorer interface. Every Scorer has a threshold parameter that you can use in the context of unit testing your app.

from judgeval.scorers import FaithfulnessScorer

scorer = FaithfulnessScorer(threshold=1.0)

You can use scorers to evaluate your LLM system's outputs by using Examples.

We're always working on adding new scorers, so if you have a metric you'd like to add, please let us know!

Congratulations! 🎉

You've learned the basics of building and running evaluations with judgeval.

For a deep dive into all the metrics you can run using judgeval scorers, click here.

Introduction

Overview

Examples

Datasets

Metrics

On this page