Custom Scorers
The JudgevalScorer
is the abstraction for custom evaluation logic.
Whether your evaluation logic is a simple algorithm, LLM-judge, or a multi-agent system, you can use a JudgevalScorer
for evaluation.
Implement a JudgevalScorer
To implement your own custom scorer, you must:
Inherit from the JudgevalScorer
class and name your scorer
This will help judgeval
integrate your scorer into evaluation runs.
from judgeval.scorers import JudgevalScorer
class SampleScorer(JudgevalScorer):
...
@property
def __name__(self):
return "Sample Scorer"
Implement the __init__()
method
JudgevalScorer
s have some required attributes that must be determined in the __init__()
method.
For instance, you must set a threshold
to determine what constitutes success/failure for a scorer.
There are additional optional attributes that can be set here for even more flexibility:
Attribute | Type | Description |
---|---|---|
score_type | str | The name of your scorer. This will be displayed in the Judgment platform. |
include_reason | bool | Whether your scorer includes a reason for the score in the results. Only for LLM judge-based scorers. |
async_mode | bool | Whether your scorer should be run asynchronously during evaluations. |
strict_mode | bool | Whether your scorer fails if the score is not perfect (1.0). |
verbose_mode | bool | Whether your scorer produces verbose logs. |
custom_example | bool | Whether your scorer should be run on custom examples. |
class SampleScorer(JudgevalScorer):
def __init__(
self,
threshold=0.5,
score_type="Sample Scorer",
include_reason=True,
async_mode=True,
strict_mode=False,
verbose_mode=True
):
super().__init__(score_type=score_type, threshold=threshold)
self.threshold = 1 if strict_mode else threshold
# Optional attributes
self.include_reason = include_reason
self.async_mode = async_mode
self.strict_mode = strict_mode
self.verbose_mode = verbose_mode
Implement the score_example()
and a_score_example()
methods
The score_example()
and a_score_example()
methods take an Example
object and execute your scorer to produce a float
(between 0 and 1) score.
Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).
These methods are the core of your scorer, and you can implement them in any way you want. Be creative!
Here's a sample implementation that integrates everything we've covered:
class SampleScorer(JudgevalScorer):
...
def score_example(self, example, ...):
try:
self.score = run_scorer_logic(example)
if self.include_reason:
self.reason = justify_score(example, self.score)
if self.verbose_mode:
self.verbose_logs = make_logs(example, self.reason, self.score)
self.success = self.score >= self.threshold
except Exception as e:
self.error = str(e)
self.success = False
async def a_score_example(self, example, ...):
try:
self.score = await a_run_scorer_logic(example) # async version
if self.include_reason:
self.reason = justify_score(example, self.score)
if self.verbose_mode:
self.verbose_logs = make_logs(example, self.reason, self.score)
self.success = self.score >= self.threshold
except Exception as e:
self.error = str(e)
self.success = False
Implement the _success_check()
method
When executing an evaluation run, judgeval
will check if your scorer has passed the _success_check()
method.
You can implement this method in any way you want, but it should return a bool
. Here's a perfectly valid implementation:
class SampleScorer(JudgevalScorer):
...
def _success_check(self):
if self.error is not None:
return False
return self.score >= self.threshold # or you can do self.success if set
Cookbooks
Code Style Scorers
Implement a scorer that evaluates the quality of code style, suitable for a PR review bot.
Cold Email Scorer
Implement a scorer that evaluates the quality of cold emails, suitable for a sales automation tool.