Custom Scorers

If none of judgeval's built-in scorers fit your evaluation criteria, you can easily build your own custom metric to be run through a JudgevalScorer.

JudgevalScorers are automatically integrated within judgeval's infrastructure, so you can:

Run your own scorer with the same syntax as any other judgeval scorer.
Use judgeval's batched evaluation infrastructure to execute scalable evaluation runs.
Have your scorer's results be viewed and analyzed in the Judgment platform.

Be creative with your custom scorers! You can measure anything you want using a JudgevalScorer, including using evaluations that aren't LLM judge-based such as ROUGE or embedding similarity.

Guidelines for Implementing Custom Scorers

To implement your own custom scorer, you must:

1. Inherit from the `JudgevalScorer` class

This will help judgeval integrate your scorer into evaluation runs.

from judgeval.scorers import JudgevalScorer

class SampleScorer(JudgevalScorer):
    ...

2. Implement the `init()` method

JudgevalScorers have some required attributes that must be determined in the __init__() method. For instance, you must set a threshold to determine what constitutes success/failure for a scorer.

There are additional optional attributes that can be set here for even more flexibility:

score_type (str): The name of your scorer. This will be displayed in the Judgment platform.
include_reason (bool): Whether your scorer includes a reason for the score in the results. Only for LLM judge-based scorers.
async_mode (bool): Whether your scorer should be run asynchronously during evaluations.
strict_mode (bool): Whether your scorer fails if the score is not perfect (1.0).
verbose_mode (bool): Whether your scorer produces verbose logs.
custom_example (bool): Whether your scorer should be run on custom examples.

class SampleScorer(JudgevalScorer):
    def __init__(
        self,
        threshold=0.5,
        score_type="Sample Scorer",
        include_reason=True,
        async_mode=True,
        strict_mode=False,
        verbose_mode=True
    ):
        super().__init__(score_type=score_type, threshold=threshold)
        self.threshold = 1 if strict_mode else threshold
        # Optional attributes
        self.include_reason = include_reason
        self.async_mode = async_mode
        self.strict_mode = strict_mode
        self.verbose_mode = verbose_mode

3. Implement the `score_example()` and `a_score_example()` methods

The score_example() and a_score_example() methods take an Example object and execute your scorer to produce a score (float) between 0 and 1. Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).

The only requirement for score_example() and a_score_example() is that they:

Take an Example as an argument (you can add other arguments too)
Set the self.score attribute
Set the self.success attribute

You can optionally set the self.reason attribute, depending on your preference.

a_score_example() is simply the async version of score_example(), so the implementation should largely be identical.

These methods are the core of your scorer, and you can implement them in any way you want. Be creative!

Handling Errors

If you want to handle errors gracefully, you can use a try block and in the except block, set the self.error attribute to the error message. This will allow judgeval to catch the error but still execute the rest of an evaluation run, assuming you have multiple examples to evaluate.

Here's a sample implementation that integrates everything we've covered:

class SampleScorer(JudgevalScorer):
    ...

    def score_example(self, example, ...):
        try:
            self.score = run_scorer_logic(example)
            if self.include_reason:
                self.reason = justify_score(example, self.score)
            if self.verbose_mode:
                self.verbose_logs = make_logs(example, self.reason, self.score)
            self.success = self.score >= self.threshold
        except Exception as e:
            self.error = str(e)
            self.success = False
    
    async def a_score_example(self, example, ...):
        try:
            self.score = await a_run_scorer_logic(example)  # async version
            if self.include_reason:
                self.reason = justify_score(example, self.score)
            if self.verbose_mode:
                self.verbose_logs = make_logs(example, self.reason, self.score)
            self.success = self.score >= self.threshold
        except Exception as e:
            self.error = str(e)
            self.success = False

4. Implement the `_success_check()` method

When executing an evaluation run, judgeval will check if your scorer has passed the _success_check() method.

You can implement this method in any way you want, but it should return a bool. Here's a perfectly valid implementation:

class SampleScorer(JudgevalScorer):
    ...

    def _success_check(self):
        if self.error is not None:
            return False
        return self.score >= self.threshold  # or you can do self.success if set

5. Give your scorer a name

This is so that when displaying your scorer's results in the Judgment platform, you can easily sort by and find your scorer.

class SampleScorer(JudgevalScorer):
    ...

    @property
    def __name__(self):
        return "Sample Scorer"

Congratulations! 🎉

You've made your first custom judgeval scorer! Now that your scorer is implemented, you can run it on your own datasets just like any other judgeval scorer. Your scorer is fully integrated with judgeval's infrastructure so you can view it on the Judgment platform too.

Implementing a Custom Scorer in TypeScript

You can also implement custom scorers using the Judgeval TypeScript SDK (@judgment/sdk). The principles are similar to Python, but the implementation details differ slightly.

1. Inherit from the `JudgevalScorer` class

Import the necessary classes and extend JudgevalScorer.

import { JudgevalScorer } from '@judgment/sdk'; // Adjust import path if needed
import { Example } from '@judgment/sdk/data';
import { ScorerData } from '@judgment/sdk/data';

export class SampleTsScorer extends JudgevalScorer {
    // ... scorer implementation ...
}

2. Implement the `constructor`

Set the required and optional attributes for your scorer. Call super() to initialize the base class.

export class SampleTsScorer extends JudgevalScorer {
    constructor(
        threshold: number = 0.5,
        score_type: string = "Sample TS Scorer",
        include_reason: boolean = true,
        async_mode: boolean = true, // Affects how runEvaluation schedules the scorer
        strict_mode: boolean = false,
        verbose_mode: boolean = true
    ) {
        // Pass core attributes to the base class constructor
        super(score_type, threshold, undefined, include_reason, async_mode, strict_mode, verbose_mode);
        // Adjust threshold based on strict_mode if necessary
        this.threshold = strict_mode ? 1.0 : threshold;
        // Store attributes if needed for scoreExample logic
        this.include_reason = include_reason;
        this.strict_mode = strict_mode;
        this.verbose_mode = verbose_mode;
    }
    // ...
}

3. Implement the `scoreExample()` method

This asynchronous method takes an Example object and performs your custom scoring logic. Unlike the Python version which sets instance attributes, the TypeScript version must return a Promise<ScorerData> object containing the scoring results.

import { Example } from '@judgment/sdk/data';
import { ScorerData } from '@judgment/sdk/data';

export class SampleTsScorer extends JudgevalScorer {
    // ... constructor ...

    async scoreExample(example: Example): Promise<ScorerData> {
        let score: number | undefined;
        let reason: string | null = null;
        let verbose_logs: string | null = null;
        let error: string | null = null;
        let success: boolean = false;

        try {
            // --- Your scoring logic here ---
            // This example checks if actualOutput contains expectedOutput (case-insensitive)
            const actual = (Array.isArray(example.actualOutput) ? example.actualOutput.join('\\n') : example.actualOutput)?.toLowerCase() || '';
            const expected = (Array.isArray(example.expectedOutput) ? example.expectedOutput.join('\\n') : example.expectedOutput)?.toLowerCase() || '';
            score = actual.includes(expected) ? 1.0 : 0.0;
            // --- End scoring logic ---

            if (this.include_reason) {
                reason = score === 1.0 ? "Actual output contains expected output." : "Actual output does not contain expected output.";
            }
            if (this.verbose_mode) {
                verbose_logs = `Comparing (case-insensitive) if "${actual}" includes "${expected}"`;
            }

            // Set score attribute temporarily for _successCheck and determine success
            this.score = score;
            success = this._successCheck();

        } catch (e) {
            error = e instanceof Error ? e.message : String(e);
            success = false; // Ensure failure on error
            score = 0; // Default score on error
            reason = `Error during scoring: ${error}`;
            this.error = error; // Set error attribute for potential use in _successCheck
        }

        // Return the ScorerData object
        return {
            name: this.name, // Use the getter
            threshold: this.threshold,
            success: success, // Result from _successCheck
            score: score ?? 0, // Provide a default if undefined
            reason: reason,
            strict_mode: this.strict_mode,
            evaluation_model: "custom-logic-scorer", // Identify the scorer logic
            error: error,
            evaluation_cost: null, // Set if applicable (e.g., LLM calls)
            verbose_logs: verbose_logs,
            additional_metadata: this.additional_metadata || {} // Include if set
        };
    }
    // ...
}

The core difference from the Python implementation is that scoreExample returns a ScorerData object rather than modifying instance attributes directly (like self.score, self.reason, etc.). The returned object contains all the necessary information about the scoring result for that specific example.

4. Implement the `_successCheck()` method

This protected method determines if the score meets the threshold for success. It's typically called internally by scoreExample after calculating the score.

export class SampleTsScorer extends JudgevalScorer {
    // ... constructor and scoreExample ...

    protected _successCheck(): boolean {
        // Check if an error occurred during scoring
        if (this.error !== undefined && this.error !== null) {
            return false;
        }
        // Check if the score was successfully calculated
        if (this.score === undefined) {
             return false;
        }
        // Compare the score against the threshold
        return this.score >= this.threshold;
    }
    // ...
}

5. Implement the `name` getter

Provide a getter for the scorer's name, which will be used in results and the Judgment platform.

export class SampleTsScorer extends JudgevalScorer {
    // ... other methods ...

    get name(): string {
        return "Sample TS Scorer"; // Return the name of your scorer
    }
}

With these steps, your TypeScript custom scorer is ready to be used with JudgmentClient.runEvaluation(), just like any built-in scorer.

Using a Custom Scorer

Once you've implemented your custom scorer, you can use it in the same way as any other scorer in judgeval. They can be run in conjunction with other scorers in a single evaluation run!

from judgeval import JudgmentClient
from your_custom_scorer import SampleScorer

client = JudgmentClient()
sample_scorer = SampleScorer()

results = client.run_evaluation(
    examples=[example1],
    scorers=[sample_scorer],
    model="gpt-4o"
)

Custom Scorers with Custom Examples

If you want to use a custom scorer with a custom example, you can do so by passing the custom scorer and custom example to the run_evaluation() method.

Make sure to set the custom_example attribute to True in the __init__() method of your custom scorer.

from judgeval import JudgmentClient
from judgeval.data import CustomExample

client = JudgmentClient()
custom_example = CustomExample(
    input={
        "question": "What if these shoes don't fit?",
    },
    actual_output={
        "answer": "We offer a 30-day full refund at no extra cost.",
    },
    expected_output={
        "answer": "We offer a 30-day full refund at no extra cost.",
    },
)

scorer = CustomScorer(threshold=0.5) # Your custom scorer
results = client.run_evaluation(
    examples=[custom_example],
    scorers=[scorer],
    model="gpt-4o-mini",
)

Real World Examples

You can find some real world examples of how our community has used custom JudgevalScorers to evaluate their LLM systems in our cookbook repository! Here are some of our favorites:

Code Style Scorer - Evaluates code quality and style
Cold Email Scorer - Evaluates the effectiveness of cold emails

For more examples and detailed documentation on custom scorers, check out our Custom Scorers Cookbook.

Custom Scorers

On this page