Introduction
Overview
Scorers act as measurement tools for evaluating LLM systems based on specific criteria.
judgeval
comes with a set of 10+ built-in scorers that you can easily start with.
judgeval
. If you have a suggestion, please let us know!Scorers execute on Example
s and EvalDataset
s, producing a numerical score.
This enables you to use evaluations as unit tests by setting a threshold
to determine whether an evaluation was successful or not.
Categories of Scorers
judgeval
supports three categories of scorers.
- Default Scorers: built-in scorers that are ready to use
- Custom Scorers: Powerful scorers that you can tailor to your own LLM system
- Classifier Scorers: A special custom scorer that evaluates your LLM system using a natural language criteria
In this section, we'll cover each kind of scorer and how to use them.
Default Scorers
Most built-in scorers in judgeval
are LLM systems, leveraging advanced prompt engineering to ensure evaluations are flexible, scalable, and aligned closely with human judgment. Each scorer provides transparent reasoning through detailed chains of thought, enabling clear validation of evaluation outcomes.
Introducing Osiris
To push the frontier of LLM evaluation further, we've developed Osiris, our state-of-the-art hallucination detection system. Created in collaboration with researchers from Judgment Labs and the Stanford Artificial Intelligence Laboratory (SAIL) under Professor Chris Manning, Osiris integrates leading-edge techniques:
- Multi-agent system
- Test-time scaling
- Reinforcement learning
- Data augmentation
These methods allow Osiris to significantly outperform traditional LLM-based evaluation methods—delivering higher accuracy in hallucination detection at substantially lower costs than closed-source models.
We’ll be releasing a report soon detailing some of the novel techniques we’ve developed—stay tuned!
We're actively extending Osiris's innovative techniques to our full suite of default scorers. Soon, you'll benefit from these improvements across all evaluation metrics, enhancing accuracy, reliability, and cost-effectiveness for your entire evaluation workflow.
Check out Osiris-powered evaluations now through our Faithfulness scorer!
All Default Scorers in judgeval:
- Answer Correctness
- Answer Relevancy
- Comparison
- Contextual Precision
- Contextual Recall
- Contextual Relevancy
- Execution Order
- Faithfulness
- Groundedness
- JSON Correctness
- Summarization
Agentic Scorers
Agentic scorers are designed to evaluate multiple steps together within your LLM system—rather than evaluating one step at a time. They are especially useful for multi-step workflows, such as agents, chains, or graphs, where the quality of the overall process matters more than individual outputs.
By grouping related steps into a single evaluation, agentic scorers allow you to:
- Measure end-to-end correctness
- Evaluate goal completion across multiple steps
- Assess coherence and strategy in long chains of reasoning
This makes them ideal for agentic systems, where individual tool calls or LLM responses should be judged as part of a larger goal-oriented behavior.
Here are the agentic scorers in judgeval
:
Custom Scorers
If you find that none of the default scorers meet your evaluation needs, setting up a custom scorer is easy with judgeval
.
You can create a custom scorer by inheritng from the JudgevalScorer
class and implementing three methods:
score_example()
: produces a score for a singleExample
.a_score_example()
: async version ofscore_example()
. You may use the same implementation logic asscore_example()
._success_check()
: determines whether an evaluation was successful.
Custom scorers can be as simple or complex as you want, and do not need to use LLMs. For sample implementations, check out the Custom Scorers documentation page.
Classifier Scorers
Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria.
They either be defined using our judgeval SDK or using the Judgment Platform directly. For more information, check out the Classifier Scorers documentation page.
Running Scorers
All scorers in judgeval
can be run uniformly through the JudgmentClient
. All scorers are set to run in async mode by default in order to support parallelized evaluations for large datasets.
...
client = JudgmentClient()
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4o",
)