Custom Scorers
Score your agent behavior using code and LLMs.
judgeval provides abstractions to implement custom scorers arbitrarily in code, enabling full flexibility in your scoring logic and use cases.
You can use any combination of code, custom LLMs as a judge, or library dependencies.
Implement a Custom Scorer
There are two types of custom scorers you can create:
- ExampleCustomScorer: For scoring individual examples
- TraceCustomScorer: For scoring traces (agent execution flows)
Create your scorer file
Create a new Python file for your scorer. Inherit from ExampleCustomScorer and implement the score method. You must specify a response type (Binary, Categorical, or Numeric) as a generic parameter.
from judgeval.v1.data import Example
from judgeval.v1.hosted import ExampleCustomScorer, BinaryResponse
class ResolutionScorer(ExampleCustomScorer[BinaryResponse]):
async def score(self, data: Example) -> BinaryResponse:
actual_output = data.get_property("actual_output")
if "package" in actual_output:
return BinaryResponse(
value=True,
reason="The response contains package information."
)
return BinaryResponse(
value=False,
reason="The response does not contain package information."
)The score method:
- Takes an Example as input
- Returns one of three response types:
BinaryResponse: For true/false evaluations with avalue(bool) andreason(string)CategoricalResponse: For categorical evaluations with avalue(string) andreason(string)NumericResponse: For numeric evaluations with avalue(float) andreason(string)
- All response types support optional
citationsto reference specific spans in your trace
Create a requirements file
Create a requirements.txt file with any dependencies your scorer needs.
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0Create your scorer file
Create a new Python file for your scorer. Inherit from TraceCustomScorer and implement the score method. You must specify a response type (Binary, Categorical, or Numeric) as a generic parameter.
from judgeval.v1.data import Trace
from judgeval.v1.hosted import TraceCustomScorer, NumericResponse
class ToolCallScorer(TraceCustomScorer[NumericResponse]):
async def score(self, data: Trace) -> NumericResponse:
tool_calls = [span for span in data if span["span_kind"] == "tool"]
return NumericResponse(
value=float(len(tool_calls)),
reason=f"Agent made {len(tool_calls)} tool call(s)."
)The score method:
- Takes a Trace (list of
TraceSpanobjects) as input - Returns one of three response types:
BinaryResponse: For true/false evaluations with avalue(bool) andreason(string)CategoricalResponse: For categorical evaluations with avalue(string) andreason(string)NumericResponse: For numeric evaluations with avalue(float) andreason(string)
- All response types support optional
citationsto reference specific spans in your trace
See Trace for all available TraceSpan properties.
Create a requirements file
Create a requirements.txt file with any dependencies your scorer needs.
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0Upload Your Scorer
Once you've implemented your custom scorer, upload it to Judgment using the CLI.
Set your credentials
Set your Judgment API key and organization ID as environment variables:
export JUDGMENT_API_KEY="your-api-key"
export JUDGMENT_ORG_ID="your-org-id"Upload the scorer
judgeval upload-scorer my_scorer.py -r requirements.txt -p my_projectYou can also provide a custom name for the scorer:
judgeval upload-scorer my_scorer.py -r requirements.txt -n "Resolution Scorer" -p my_projectTo overwrite an existing scorer with the same name:
judgeval upload-scorer my_scorer.py -r requirements.txt -o -p my_projectCLI Options
scorer_file_pathrequired
:str--requirement, -r
:str--name, -n
:str--project, -p
:str--overwrite, -o
:flagSet Environmental Variables
You can also set environmental variables for your custom scorer on the Judgment platform.
Navigate to the Scorers page within your project and click on the Custom Scorers tab


Click on the custom scorer you would like to add the environmental variables for


Click on the "Environment variables" button on the top right of the page


Enter in the environmental variable and click the "Add" button. Do this for all environmental variables needed for the custom scorer.
Use Your Uploaded Scorer
After uploading, you can use your custom scorer in evaluations by retrieving it through the client factory and running it within a traced function.
Use async_evaluate to run an ExampleCustomScorer on a specific example:
from judgeval import Judgeval
from judgeval.v1.data import Example
judgeval = Judgeval(project_name="my_project")
# Retrieve your uploaded scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="Resolution Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
# Your agent logic here
response = f"Your package will arrive tomorrow at 10:00 AM."
# Evaluate the response asynchronously
tracer.async_evaluate(
scorer=scorer,
example=Example.create(
input=question,
actual_output=response
)
)
return response
# Run your agent
result = my_agent("Where is my package?")Use async_trace_evaluate to run a TraceCustomScorer on the current trace:
from judgeval import Judgeval
judgeval = Judgeval(project_name="my_project")
# Retrieve your uploaded trace scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="ToolCallScorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="tool")
def search_database(query: str) -> str:
# Tool logic here
return f"Results for: {query}"
@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
# Your agent logic with tool calls
search_result = search_database(question)
response = f"Based on my search: {search_result}"
# Evaluate the entire trace asynchronously
tracer.async_trace_evaluate(scorer=scorer)
return response
# Run your agent
result = my_agent("What products are available?")Full Example
Here's a complete example of creating, uploading, and using a custom scorer:
Create the scorer
from judgeval.v1.data import Example
from judgeval.v1.hosted import ExampleCustomScorer, CategoricalResponse
class SentimentScorer(ExampleCustomScorer[CategoricalResponse]):
async def score(self, data: Example) -> CategoricalResponse:
actual_output = data.get_property("actual_output")
positive_words = ["great", "excellent", "happy", "pleased", "thank"]
negative_words = ["sorry", "unfortunately", "cannot", "unable", "problem"]
positive_count = sum(1 for word in positive_words if word in actual_output.lower())
negative_count = sum(1 for word in negative_words if word in actual_output.lower())
if positive_count > negative_count:
return CategoricalResponse(
value="positive",
reason=f"Response has positive sentiment ({positive_count} positive, {negative_count} negative words)."
)
elif negative_count > positive_count:
return CategoricalResponse(
value="negative",
reason=f"Response has negative sentiment ({positive_count} positive, {negative_count} negative words)."
)
return CategoricalResponse(
value="neutral",
reason="Response has neutral sentiment."
)Create requirements.txt
# No external dependencies needed for this scorerUpload the scorer
judgeval upload-scorer sentiment_scorer.py -r requirements.txt -n "Sentiment Scorer" -p sentiment_analysisUse the scorer in an evaluation
from judgeval import Judgeval
from judgeval.v1.data import Example
judgeval = Judgeval(project_name="sentiment_analysis")
# Retrieve the uploaded scorer
scorer = judgeval.scorers.custom_scorer.get(name="Sentiment Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="function")
def customer_support_agent(question: str) -> str:
# Your agent logic here
response = "I had a great experience! The service was excellent."
# Evaluate the response
tracer.async_evaluate(
scorer=scorer,
example=Example.create(
input=question,
actual_output=response
)
)
return response
# Run your agent
result = customer_support_agent("How was your experience?")Create the scorer
from judgeval.v1.data import Trace
from judgeval.v1.hosted import TraceCustomScorer, NumericResponse
class ToolCallScorer(TraceCustomScorer[NumericResponse]):
async def score(self, data: Trace) -> NumericResponse:
tool_calls = [span for span in data if span["span_kind"] == "tool"]
return NumericResponse(
value=float(len(tool_calls)),
reason=f"Agent made {len(tool_calls)} tool call(s)."
)Create requirements.txt
# No external dependencies needed for this scorerUpload the scorer
judgeval upload-scorer tool_call_scorer.py -r requirements.txt -n "Tool Call Scorer" -p agent_monitoringUse the scorer in an evaluation
from judgeval import Judgeval
judgeval = Judgeval(project_name="agent_monitoring")
# Retrieve the uploaded trace scorer
scorer = judgeval.scorers.custom_scorer.get(name="Tool Call Scorer")
# Create a tracer
tracer = judgeval.tracer.create()
@tracer.observe(span_type="tool")
def search_products(query: str) -> str:
return f"Found 5 products matching: {query}"
@tracer.observe(span_type="function")
def shopping_agent(question: str) -> str:
# Agent makes tool calls
products = search_products(question)
response = f"Here are the results: {products}"
# Evaluate the entire trace
tracer.async_trace_evaluate(scorer=scorer)
return response
# Run your agent
result = shopping_agent("Show me laptops under $1000")