Custom Scorers
If your use case requires specialized evaluation, judgeval
allows you to implement custom scorers arbitrarily in code, using any LLM as a judge or library dependency.
They are automatically versioned and synced with the Judgment Platform.
Implement a CustomScorer
To implement your own scorer with more complex code logic, you must:
Inherit from the ExampleScorer
class
from judgeval.scorers.example_scorer import ExampleScorer
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
ExampleScorer has the following attributes that you can access:
Attribute | Type | Description | Default |
---|---|---|---|
name | str | The name of your scorer to be displayed on the Judgment platform. | "Custom" |
score | float | The score of the scorer. | N/A |
threshold | float | The threshold for the scorer. | 0.5 |
reason | str | A description for why the score was given. | N/A |
error | str | An error message if the scorer fails. | N/A |
additional_metadata | dict | Additional metadata to be added to the scorer | N/A |
Define your Custom Example Class
You can create your own custom Example class by inheriting from the base Example object. This allows you to configure any fields you want to score.
from judgeval.data import Example
class CustomerRequest(Example):
request: str
response: str
example = CustomerRequest(
request="Where is my package?",
response="Your package will arrive tomorrow at 10:00 AM.",
)
Implement the a_score_example()
method
The a_score_example()
method takes an Example
object and execute your scorer to produce a float
(between 0 and 1) score.
Optionally, you can include a reason to accompany the score if applicable (e.g. for LLM judge-based scorers).
These methods are the core of your scorer, and you can implement them in any way you want. Be creative!
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
# This is using the CustomerRequest class we defined in the previous step
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
score = await scoring_function(example.request, example.response)
self.reason = justify_score(example.request, example.response, score)
return score
Implementation Example
Here is a basic implementation of implementing a ExampleScorer.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")
res = client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
)
Next Steps
Ready to use your custom scorers in production? Learn how to monitor agent behavior with online evaluations.
Monitor Agent Behavior in Production
Use Custom Scorers to continuously evaluate your agents in real-time production environments.