Custom Scorers

Score your agent behavior using code and LLMs.

judgeval provides abstractions to implement custom scorers arbitrarily in code, enabling full flexibility in your scoring logic and use cases. You can use any combination of code, custom LLMs as a judge, or library dependencies.

Your scorers can be automatically versioned and synced with the Judgment Platform to be run in production with zero latency impact.

Currently, Python is the only supported language for creating Custom Scorers.

Implement a CustomScorer

Inherit from the ExampleScorer class

customer_request_scorer.py
from judgeval.scorers.example_scorer import ExampleScorer

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

ExampleScorer has the following attributes that you can access:

name:str

The name of your scorer to be displayed on the Judgment platform.

Default: "Custom"
score:float
The score of the scorer.
threshold:float
The threshold for the scorer.
Default: 0.5
reason:str
A description for why the score was given.
error:str
An error message if the scorer fails.
additional_metadata:dict

Additional metadata to be added to the scorer

Define your Custom Example Class

You can create your own custom Example class by inheriting from the base Example object. This allows you to configure any fields you want to score.

custom_example.py
from judgeval.data import Example

class CustomerRequest(Example):
    request: str
    response: str

example = CustomerRequest(
    request="Where is my package?",
    response="Your package will arrive tomorrow at 10:00 AM.",
)

Implement the a_score_example() method

The a_score_example() method takes an Example object and executes your scorer asynchronously to produce a float (between 0 and 1) score.

The only requirement for a_score_example() is that it:

  • Take an Example as an argument
  • Returns a float between 0 and 1

You can optionally set the self.reason attribute, depending on your preference.

This method is the core of your scorer, and you can implement it in any way you want.

You can optionally include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).

example_scorer.py
class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    # This is using the CustomerRequest class we defined in the previous step
    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        score = await scoring_function(example.request, example.response)
        self.reason = justify_score(example.request, example.response, score)
        return score

Implementation Example

Here is a basic implementation of implementing a ExampleScorer.

happiness_scorer.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

example = CustomerRequest(
  request="Where is my package?",
  response="Your package will arrive tomorrow at 10:00 AM."
)
res = client.run_evaluation(
    examples=[example],
    scorers=[ResolutionScorer()],
    project_name="default_project",
)