Introduction

We assume you are familiar with the data primitives used in judgeval as well as the agent modules you may be evaluating. If you are not familiar with these concepts, we recommend you understand them before continuing.

Quickstart

evals.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

agent = ...  # your agent
task = "What if these shoes don't fit?"
example = Example(
    input=task,
    actual_output=agent.run(task),  # e.g. "We offer a 30-day full refund at no extra cost."
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
)
print(results)

Evals in judgeval consist of three components:

Example objects contain the fields involved in the eval. You can group Example objects into Dataset objects for scaled testing/evals.

Scorer objects encode the methodology for scoring the Example objects based on an evaluation criteria.

When using LLM-as-a-judge evals, Judge objects can be used to load specific LLMs to use as the judge model.

Build your own scorer using a custom scorer or classifier scorer.
Request a new scorer by opening a GitHub issue or by contacting our team.

Introduction

Quickstart

FAQ

How am I supposed to use evals?

What kind of evals can I use?

What metrics do you support?

On this page