Custom Scorers
If none of judgeval
's built-in scorers fit your evaluation criteria, you can easily build your own custom metric to be run through a JudgevalScorer
.
JudgevalScorer
s are automatically integrated within judgeval
's infrastructure, so you can:
- Run your own scorer with the same syntax as any other
judgeval
scorer. - Use
judgeval
's batched evaluation infrastructure to execute scalable evaluation runs. - Have your scorer's results be viewed and analyzed in the Judgment platform.
JudgevalScorer
, including using evaluations that aren't LLM judge-based such as ROUGE or embedding similarity.Guidelines for Implementing Custom Scorers
To implement your own custom scorer, you must:
1. Inherit from the JudgevalScorer
class
This will help judgeval
integrate your scorer into evaluation runs.
from judgeval.scorers import JudgevalScorer
class SampleScorer(JudgevalScorer):
...
2. Implement the __init__()
method
JudgevalScorer
s have some required attributes that must be determined in the __init__()
method.
For instance, you must set a threshold
to determine what constitutes success/failure for a scorer.
There are additional optional attributes that can be set here for even more flexibility:
score_type (str)
: The name of your scorer. This will be displayed in the Judgment platform.include_reason (bool)
: Whether your scorer includes a reason for the score in the results. Only for LLM judge-based scorers.async_mode (bool)
: Whether your scorer should be run asynchronously during evaluations.strict_mode (bool)
: Whether your scorer fails if the score is not perfect (1.0).verbose_mode (bool)
: Whether your scorer produces verbose logs.custom_example (bool)
: Whether your scorer should be run on custom examples.
class SampleScorer(JudgevalScorer):
def __init__(
self,
threshold=0.5,
score_type="Sample Scorer",
include_reason=True,
async_mode=True,
strict_mode=False,
verbose_mode=True
):
super().__init__(score_type=score_type, threshold=threshold)
self.threshold = 1 if strict_mode else threshold
# Optional attributes
self.include_reason = include_reason
self.async_mode = async_mode
self.strict_mode = strict_mode
self.verbose_mode = verbose_mode
3. Implement the score_example()
and a_score_example()
methods
The score_example()
and a_score_example()
methods take an Example
object and execute your scorer to produce a score (float) between 0 and 1.
Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).
The only requirement for score_example()
and a_score_example()
is that they:
- Take an
Example
as an argument (you can add other arguments too) - Set the
self.score
attribute - Set the
self.success
attribute
You can optionally set the self.reason attribute, depending on your preference.
a_score_example()
is simply the async version of score_example()
, so the implementation should largely be identical.These methods are the core of your scorer, and you can implement them in any way you want. Be creative!
Handling Errors
If you want to handle errors gracefully, you can use a try
block and in the except
block, set the self.error
attribute to the error message.
This will allow judgeval
to catch the error but still execute the rest of an evaluation run, assuming you have multiple examples to evaluate.
Here's a sample implementation that integrates everything we've covered:
class SampleScorer(JudgevalScorer):
...
def score_example(self, example, ...):
try:
self.score = run_scorer_logic(example)
if self.include_reason:
self.reason = justify_score(example, self.score)
if self.verbose_mode:
self.verbose_logs = make_logs(example, self.reason, self.score)
self.success = self.score >= self.threshold
except Exception as e:
self.error = str(e)
self.success = False
async def a_score_example(self, example, ...):
try:
self.score = await a_run_scorer_logic(example) # async version
if self.include_reason:
self.reason = justify_score(example, self.score)
if self.verbose_mode:
self.verbose_logs = make_logs(example, self.reason, self.score)
self.success = self.score >= self.threshold
except Exception as e:
self.error = str(e)
self.success = False
4. Implement the _success_check()
method
When executing an evaluation run, judgeval
will check if your scorer has passed the _success_check()
method.
You can implement this method in any way you want, but it should return a bool
. Here's a perfectly valid implementation:
class SampleScorer(JudgevalScorer):
...
def _success_check(self):
if self.error is not None:
return False
return self.score >= self.threshold # or you can do self.success if set
5. Give your scorer a name
This is so that when displaying your scorer's results in the Judgment platform, you can easily sort by and find your scorer.
class SampleScorer(JudgevalScorer):
...
@property
def __name__(self):
return "Sample Scorer"
Congratulations! 🎉
You've made your first custom judgeval scorer! Now that your scorer is implemented, you can run it on your own datasets
just like any other judgeval
scorer. Your scorer is fully integrated with judgeval
's infrastructure so you can view it on
the Judgment platform too.
Implementing a Custom Scorer in TypeScript
You can also implement custom scorers using the Judgeval TypeScript SDK (@judgment/sdk
). The principles are similar to Python, but the implementation details differ slightly.
1. Inherit from the JudgevalScorer
class
Import the necessary classes and extend JudgevalScorer
.
import { JudgevalScorer } from '@judgment/sdk'; // Adjust import path if needed
import { Example } from '@judgment/sdk/data';
import { ScorerData } from '@judgment/sdk/data';
export class SampleTsScorer extends JudgevalScorer {
// ... scorer implementation ...
}
2. Implement the constructor
Set the required and optional attributes for your scorer. Call super()
to initialize the base class.
export class SampleTsScorer extends JudgevalScorer {
constructor(
threshold: number = 0.5,
score_type: string = "Sample TS Scorer",
include_reason: boolean = true,
async_mode: boolean = true, // Affects how runEvaluation schedules the scorer
strict_mode: boolean = false,
verbose_mode: boolean = true
) {
// Pass core attributes to the base class constructor
super(score_type, threshold, undefined, include_reason, async_mode, strict_mode, verbose_mode);
// Adjust threshold based on strict_mode if necessary
this.threshold = strict_mode ? 1.0 : threshold;
// Store attributes if needed for scoreExample logic
this.include_reason = include_reason;
this.strict_mode = strict_mode;
this.verbose_mode = verbose_mode;
}
// ...
}
3. Implement the scoreExample()
method
This asynchronous method takes an Example
object and performs your custom scoring logic. Unlike the Python version which sets instance attributes, the TypeScript version must return a Promise<ScorerData>
object containing the scoring results.
import { Example } from '@judgment/sdk/data';
import { ScorerData } from '@judgment/sdk/data';
export class SampleTsScorer extends JudgevalScorer {
// ... constructor ...
async scoreExample(example: Example): Promise<ScorerData> {
let score: number | undefined;
let reason: string | null = null;
let verbose_logs: string | null = null;
let error: string | null = null;
let success: boolean = false;
try {
// --- Your scoring logic here ---
// This example checks if actualOutput contains expectedOutput (case-insensitive)
const actual = (Array.isArray(example.actualOutput) ? example.actualOutput.join('\\n') : example.actualOutput)?.toLowerCase() || '';
const expected = (Array.isArray(example.expectedOutput) ? example.expectedOutput.join('\\n') : example.expectedOutput)?.toLowerCase() || '';
score = actual.includes(expected) ? 1.0 : 0.0;
// --- End scoring logic ---
if (this.include_reason) {
reason = score === 1.0 ? "Actual output contains expected output." : "Actual output does not contain expected output.";
}
if (this.verbose_mode) {
verbose_logs = `Comparing (case-insensitive) if "${actual}" includes "${expected}"`;
}
// Set score attribute temporarily for _successCheck and determine success
this.score = score;
success = this._successCheck();
} catch (e) {
error = e instanceof Error ? e.message : String(e);
success = false; // Ensure failure on error
score = 0; // Default score on error
reason = `Error during scoring: ${error}`;
this.error = error; // Set error attribute for potential use in _successCheck
}
// Return the ScorerData object
return {
name: this.name, // Use the getter
threshold: this.threshold,
success: success, // Result from _successCheck
score: score ?? 0, // Provide a default if undefined
reason: reason,
strict_mode: this.strict_mode,
evaluation_model: "custom-logic-scorer", // Identify the scorer logic
error: error,
evaluation_cost: null, // Set if applicable (e.g., LLM calls)
verbose_logs: verbose_logs,
additional_metadata: this.additional_metadata || {} // Include if set
};
}
// ...
}
scoreExample
returns a ScorerData
object rather than modifying instance attributes directly (like self.score
, self.reason
, etc.). The returned object contains all the necessary information about the scoring result for that specific example.4. Implement the _successCheck()
method
This protected method determines if the score meets the threshold for success. It's typically called internally by scoreExample
after calculating the score.
export class SampleTsScorer extends JudgevalScorer {
// ... constructor and scoreExample ...
protected _successCheck(): boolean {
// Check if an error occurred during scoring
if (this.error !== undefined && this.error !== null) {
return false;
}
// Check if the score was successfully calculated
if (this.score === undefined) {
return false;
}
// Compare the score against the threshold
return this.score >= this.threshold;
}
// ...
}
5. Implement the name
getter
Provide a getter for the scorer's name, which will be used in results and the Judgment platform.
export class SampleTsScorer extends JudgevalScorer {
// ... other methods ...
get name(): string {
return "Sample TS Scorer"; // Return the name of your scorer
}
}
With these steps, your TypeScript custom scorer is ready to be used with JudgmentClient.runEvaluation()
, just like any built-in scorer.
Using a Custom Scorer
Once you've implemented your custom scorer, you can use it in the same way as any other scorer in judgeval
.
They can be run in conjunction with other scorers in a single evaluation run!
from judgeval import JudgmentClient
from your_custom_scorer import SampleScorer
client = JudgmentClient()
sample_scorer = SampleScorer()
results = client.run_evaluation(
examples=[example1],
scorers=[sample_scorer],
model="gpt-4o"
)
Custom Scorers with Custom Examples
If you want to use a custom scorer with a custom example, you can do so by passing the custom scorer and custom example to the run_evaluation()
method.
custom_example
attribute to True
in the __init__()
method of your custom scorer.from judgeval import JudgmentClient
from judgeval.data import CustomExample
client = JudgmentClient()
custom_example = CustomExample(
input={
"question": "What if these shoes don't fit?",
},
actual_output={
"answer": "We offer a 30-day full refund at no extra cost.",
},
expected_output={
"answer": "We offer a 30-day full refund at no extra cost.",
},
)
scorer = CustomScorer(threshold=0.5) # Your custom scorer
results = client.run_evaluation(
examples=[custom_example],
scorers=[scorer],
model="gpt-4o-mini",
)
Real World Examples
You can find some real world examples of how our community has used custom JudgevalScorer
s to evaluate their LLM systems in our cookbook repository!
Here are some of our favorites:
- Code Style Scorer - Evaluates code quality and style
- Cold Email Scorer - Evaluates the effectiveness of cold emails
For more examples and detailed documentation on custom scorers, check out our Custom Scorers Cookbook.