Introduction
How to use and build evaluation metrics in your testing pipelines
Quickstart
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
agent = ... # your agent
task = "What if these shoes don't fit?"
example = Example(
input=task,
actual_output=agent.run(task), # e.g. "We offer a 30-day full refund at no extra cost."
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4.1",
)
print(results)
Evals in judgeval
consist of three components:
Example
objects contain the fields involved in the eval.
You can group Example
objects into Dataset
objects for scaled testing/evals.
Scorer
objects encode the methodology for scoring the Example
objects based on an evaluation criteria.
When using LLM-as-a-judge evals, Judge
objects can be used to
load specific LLMs to use as the judge model.
FAQ
How am I supposed to use evals?
If you're unsure of how to best utilize evals, check out our suggested use cases.
What kind of evals can I use?
There are three main types of evals you can use: LLM-as-a-judge, code, and human annotation. To find the best fit for your use case, check out this table.
What metrics do you support?
You can find a list of plug-and-play metrics in our scorers section. Each of our scorer implementations are research-backed by Stanford and Berkeley AI labs and are designed to work well out of the box.