Custom Scorers
Score your agent behavior using code and LLMs.
judgeval provides abstractions to implement custom scorers arbitrarily in code, enabling full flexibility in your scoring logic and use cases.
You can use any combination of code, custom LLMs as a judge, or library dependencies.
Implement a Custom Scorer
There are two types of custom scorers you can create:
- ExampleScorer: For scoring individual examples
- TraceScorer: For scoring traces (agent execution flows)
Create your scorer file
Create a new Python file for your scorer. Inherit from ExampleScorer and implement the score method.
from judgeval.v1.data import Example
from judgeval.v1.scorers import ExampleScorer, CustomScorerResult
class ResolutionScorer(ExampleScorer):
def score(self, example: Example) -> CustomScorerResult:
actual_output = example.get_property("actual_output")
if "package" in actual_output:
return CustomScorerResult(
score=1.0,
reason="The response contains package information."
)
return CustomScorerResult(
score=0.0,
reason="The response does not contain package information."
)The score method:
- Takes an Example as input
- Returns a CustomScorerResult with a
score(float between 0 and 1) and areason(string explanation)
Create a requirements file
Create a requirements.txt file with any dependencies your scorer needs.
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0Create your scorer file
Create a new Python file for your scorer. Inherit from TraceScorer and implement the score method.
from judgeval.v1.data import Trace
from judgeval.v1.scorers import TraceScorer, CustomScorerResult
class ToolCallScorer(TraceScorer):
def score(self, trace: Trace) -> CustomScorerResult:
tool_calls = [span for span in trace if span["span_kind"] == "tool"]
if len(tool_calls) > 0:
return CustomScorerResult(
score=1,
reason=f"Agent made {len(tool_calls)} tool call(s)."
)
return CustomScorerResult(
score=0,
reason="Agent did not make any tool calls."
)The score method:
- Takes a Trace (list of
TraceSpanobjects) as input - Returns a CustomScorerResult with a
score(float between 0 and 1) and areason(string explanation)
See Trace for all available TraceSpan properties.
Create a requirements file
Create a requirements.txt file with any dependencies your scorer needs.
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0Upload Your Scorer
Once you've implemented your custom scorer, upload it to Judgment using the CLI.
Set your credentials
Set your Judgment API key and organization ID as environment variables:
export JUDGMENT_API_KEY="your-api-key"
export JUDGMENT_ORG_ID="your-org-id"Upload the scorer
judgeval upload-scorer my_scorer.py -r requirements.txt -p my_projectYou can also provide a custom name for the scorer:
judgeval upload-scorer my_scorer.py -r requirements.txt -n "Resolution Scorer" -p my_projectTo overwrite an existing scorer with the same name:
judgeval upload-scorer my_scorer.py -r requirements.txt -o -p my_projectCLI Options
scorer_file_pathrequired
:str--requirement, -r
:str--name, -n
:str--project, -p
:str--overwrite, -o
:flagSet Environmental Variables
You can also set environmental variables for your custom scorer on the Judgment platform.
Navigate to the Scorers page within your project and click on the Custom Scorers tab


Click on the custom scorer you would like to add the environmental variables for


Click on the "Environment variables" button on the top right of the page


Enter in the environmental variable and click the "Add" button. Do this for all environmental variables needed for the custom scorer.
Use Your Uploaded Scorer
After uploading, you can use your custom scorer in evaluations by retrieving it through the client factory and running it within a traced function.
Use async_evaluate to run an ExampleScorer on a specific example:
from judgeval import Judgeval
from judgeval.v1.data import Example
judgeval = Judgeval(project_name="my_project")
# Retrieve your uploaded scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="Resolution Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
# Your agent logic here
response = f"Your package will arrive tomorrow at 10:00 AM."
# Evaluate the response asynchronously
tracer.async_evaluate(
scorer=scorer,
example=Example.create(
input=question,
actual_output=response
)
)
return response
# Run your agent
result = my_agent("Where is my package?")Use async_trace_evaluate to run a TraceScorer on the current trace:
from judgeval import Judgeval
judgeval = Judgeval(project_name="my_project")
# Retrieve your uploaded trace scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="ToolCallScorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="tool")
def search_database(query: str) -> str:
# Tool logic here
return f"Results for: {query}"
@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
# Your agent logic with tool calls
search_result = search_database(question)
response = f"Based on my search: {search_result}"
# Evaluate the entire trace asynchronously
tracer.async_trace_evaluate(scorer=scorer)
return response
# Run your agent
result = my_agent("What products are available?")Full Example
Here's a complete example of creating, uploading, and using a custom scorer:
Create the scorer
from judgeval.v1.data import Example
from judgeval.v1.scorers import ExampleScorer, CustomScorerResult
class SentimentScorer(ExampleScorer):
def score(self, example: Example) -> CustomScorerResult:
actual_output = example.get_property("actual_output")
positive_words = ["great", "excellent", "happy", "pleased", "thank"]
negative_words = ["sorry", "unfortunately", "cannot", "unable", "problem"]
positive_count = sum(1 for word in positive_words if word in actual_output.lower())
negative_count = sum(1 for word in negative_words if word in actual_output.lower())
if positive_count > negative_count:
return CustomScorerResult(
score=1.0,
reason=f"Response has positive sentiment ({positive_count} positive, {negative_count} negative words)."
)
elif negative_count > positive_count:
return CustomScorerResult(
score=0.0,
reason=f"Response has negative sentiment ({positive_count} positive, {negative_count} negative words)."
)
return CustomScorerResult(
score=0.5,
reason="Response has neutral sentiment."
)Create requirements.txt
# No external dependencies needed for this scorerUpload the scorer
judgeval upload-scorer sentiment_scorer.py -r requirements.txt -n "Sentiment Scorer" -p sentiment_analysisUse the scorer in an evaluation
from judgeval import Judgeval
from judgeval.v1.data import Example
judgeval = Judgeval(project_name="sentiment_analysis")
# Retrieve the uploaded scorer
scorer = judgeval.scorers.custom_scorer.get(name="Sentiment Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="function")
def customer_support_agent(question: str) -> str:
# Your agent logic here
response = "I had a great experience! The service was excellent."
# Evaluate the response
tracer.async_evaluate(
scorer=scorer,
example=Example.create(
input=question,
actual_output=response
)
)
return response
# Run your agent
result = customer_support_agent("How was your experience?")Create the scorer
from judgeval.v1.data import Trace
from judgeval.v1.scorers import TraceScorer, CustomScorerResult
class ToolCallScorer(TraceScorer):
def score(self, trace: Trace) -> CustomScorerResult:
tool_calls = [span for span in trace if span["span_kind"] == "tool"]
if len(tool_calls) > 0:
return CustomScorerResult(
score=1.0,
reason=f"Agent made {len(tool_calls)} tool call(s)."
)
return CustomScorerResult(
score=0.0,
reason="Agent did not make any tool calls."
)Create requirements.txt
# No external dependencies needed for this scorerUpload the scorer
judgeval upload-scorer tool_call_scorer.py -r requirements.txt -n "Tool Call Scorer" -p agent_monitoringUse the scorer in an evaluation
from judgeval import Judgeval
judgeval = Judgeval(project_name="agent_monitoring")
# Retrieve the uploaded trace scorer
scorer = judgeval.scorers.custom_scorer.get(name="Tool Call Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="tool")
def search_products(query: str) -> str:
return f"Found 5 products matching: {query}"
@tracer.observe(span_type="function")
def shopping_agent(question: str) -> str:
# Agent makes tool calls
products = search_products(question)
response = f"Here are the results: {products}"
# Evaluate the entire trace
tracer.async_trace_evaluate(scorer=scorer)
return response
# Run your agent
result = shopping_agent("Show me laptops under $1000")