Introduction

Overview

Scorers act as measurement tools for evaluating LLM systems based on specific criteria. judgeval comes with a set of 10+ built-in scorers that you can easily start with.

We're always adding new scorers to judgeval. If you have a suggestion, please let us know!

Scorers execute on Examples and EvalDatasets, producing a numerical score. This enables you to use evaluations as unit tests by setting a threshold to determine whether an evaluation was successful or not.

Categories of Scorers

judgeval supports three categories of scorers.

Default Scorers: built-in scorers that are ready to use
Custom Scorers: Powerful scorers that you can tailor to your own LLM system
Classifier Scorers: A special custom scorer that evaluates your LLM system using a natural language criteria

In this section, we'll cover each kind of scorer and how to use them.

Default Scorers

Most built-in scorers in judgeval are LLM systems, leveraging advanced prompt engineering to ensure evaluations are flexible, scalable, and aligned closely with human judgment. Each scorer provides transparent reasoning through detailed chains of thought, enabling clear validation of evaluation outcomes.

Introducing Osiris

To push the frontier of LLM evaluation further, we've developed Osiris, our state-of-the-art hallucination detection system. Created in collaboration with researchers from Judgment Labs and the Stanford Artificial Intelligence Laboratory (SAIL) under Professor Chris Manning, Osiris integrates leading-edge techniques:

Multi-agent system
Test-time scaling
Reinforcement learning
Data augmentation

These methods allow Osiris to significantly outperform traditional LLM-based evaluation methods—delivering higher accuracy in hallucination detection at substantially lower costs than closed-source models.

We’ll be releasing a report soon detailing some of the novel techniques we’ve developed—stay tuned!

We're actively extending Osiris's innovative techniques to our full suite of default scorers. Soon, you'll benefit from these improvements across all evaluation metrics, enhancing accuracy, reliability, and cost-effectiveness for your entire evaluation workflow.

Check out Osiris-powered evaluations now through our Faithfulness scorer!

All Default Scorers in judgeval:

Agentic Scorers

Agentic scorers are designed to evaluate multiple steps together within your LLM system—rather than evaluating one step at a time. They are especially useful for multi-step workflows, such as agents, chains, or graphs, where the quality of the overall process matters more than individual outputs.

By grouping related steps into a single evaluation, agentic scorers allow you to:

Measure end-to-end correctness
Evaluate goal completion across multiple steps
Assess coherence and strategy in long chains of reasoning

This makes them ideal for agentic systems, where individual tool calls or LLM responses should be judged as part of a larger goal-oriented behavior.

Here are the agentic scorers in judgeval:

Derailment

Custom Scorers

If you find that none of the default scorers meet your evaluation needs, setting up a custom scorer is easy with judgeval. You can create a custom scorer by inheritng from the JudgevalScorer class and implementing three methods:

score_example(): produces a score for a single Example.
a_score_example(): async version of score_example(). You may use the same implementation logic as score_example().
_success_check(): determines whether an evaluation was successful.

Custom scorers can be as simple or complex as you want, and do not need to use LLMs. For sample implementations, check out the Custom Scorers documentation page.

Classifier Scorers

Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria.

They either be defined using our judgeval SDK or using the Judgment Platform directly. For more information, check out the Classifier Scorers documentation page.

Running Scorers

All scorers in judgeval can be run uniformly through the JudgmentClient. All scorers are set to run in async mode by default in order to support parallelized evaluations for large datasets.

...

client = JudgmentClient()
results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4o",
)

To learn about how a certain default scorer works, check out its documentation page for a deep dive into how scores are calculated and what fields are required.

Introduction

On this page