Introduction
Overview
Evaluation is the process of scoring an LLM system's outputs with metrics; an evaluation is composed of:
- An evaluation dataset
- Metrics we are interested in tracking
Examples
In judgeval
, an Example is a unit of data that allows you to use evaluation scorers on your LLM system.
from judgeval.data import Example
example = Example(
input="Who founded Microsoft?",
actual_output="Bill Gates and Paul Allen.",
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
)
In this example, input
represents a user talking with a RAG-based LLM application, where actual_output
is the
output of your chatbot and retrieval_context
(Python) or context
(Typescript) is the retrieved context.
Example
that can be used in an evaluation. To learn more about the Example
class, click here.Creating an Example allows you to evaluate using
judgeval
's default scorers:
from judgeval import JudgmentClient
from judgeval.scorers import FaithfulnessScorer
from judgeval.data import Example
# Assume example is defined as above
example = Example(
input="Who founded Microsoft?",
actual_output="Bill Gates and Paul Allen.",
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
)
client = JudgmentClient()
faithfulness_scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=[example],
scorers=[faithfulness_scorer],
model="gpt-4o",
)
# You also run evaluations asynchronously like so:
results = client.a_run_evaluation(
examples=[example],
scorers=[faithfulness_scorer],
model="gpt-4o",
)
print(results)
Datasets
An Evaluation Dataset is a collection of Examples. It provides an interface for running scaled evaluations of your LLM system using one or more scorers.
from judgeval.data import Example
from judgeval.data.datasets import EvalDataset
example1 = Example(input="...", actual_output="...", retrieval_context="...")
example2 = Example(input="...", actual_output="...", retrieval_context="...")
dataset = EvalDataset(examples=[example1, example2])
EvalDataset
s can be saved (loaded) to (from) disk in csv
, yaml
, and json
format or uploaded to the Judgment platform.
EvalDataset
s, please see the EvalDataset docs.Then, you can run evaluations on the dataset:
from judgeval import JudgmentClient
from judgeval.scorers import FaithfulnessScorer
# Assume dataset is defined as above
client = JudgmentClient()
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=dataset.examples,
scorers=[scorer],
model="Qwen/Qwen2.5-72B-Instruct-Turbo",
)
Metrics
judgeval
comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through judgeval
's Scorer
interface.
Every Scorer
has a threshold
parameter that you can use in the context of unit testing your app.
from judgeval.scorers import FaithfulnessScorer
scorer = FaithfulnessScorer(threshold=1.0)
You can use scorers to evaluate your LLM system's outputs by using Example
s.
Congratulations! 🎉
You've learned the basics of building and running evaluations with judgeval
.
For a deep dive into all the metrics you can run using judgeval
scorers, click here.