Custom Scorers

Score your agent behavior using code and LLMs.

judgeval provides abstractions to implement custom scorers arbitrarily in code, enabling full flexibility in your scoring logic and use cases. You can use any combination of code, custom LLMs as a judge, or library dependencies.

Currently, Python is the only supported language for creating Custom Scorers.

Implement a Custom Scorer

There are two types of custom scorers you can create:

Create your scorer file

Create a new Python file for your scorer. Inherit from ExampleCustomScorer and implement the score method. You must specify a response type (Binary, Categorical, or Numeric) as a generic parameter.

my_scorer.py
from judgeval.v1.data import Example
from judgeval.v1.hosted import ExampleCustomScorer, BinaryResponse

class ResolutionScorer(ExampleCustomScorer[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        actual_output = data.get_property("actual_output")

        if "package" in actual_output:
            return BinaryResponse(
                value=True,
                reason="The response contains package information."
            )

        return BinaryResponse(
            value=False,
            reason="The response does not contain package information."
        )

The score method:

  • Takes an Example as input
  • Returns one of three response types:
    • BinaryResponse: For true/false evaluations with a value (bool) and reason (string)
    • CategoricalResponse: For categorical evaluations with a value (string) and reason (string)
    • NumericResponse: For numeric evaluations with a value (float) and reason (string)
  • All response types support optional citations to reference specific spans in your trace

Create a requirements file

Create a requirements.txt file with any dependencies your scorer needs.

requirements.txt
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0

Create your scorer file

Create a new Python file for your scorer. Inherit from TraceCustomScorer and implement the score method. You must specify a response type (Binary, Categorical, or Numeric) as a generic parameter.

my_trace_scorer.py
from judgeval.v1.data import Trace
from judgeval.v1.hosted import TraceCustomScorer, NumericResponse

class ToolCallScorer(TraceCustomScorer[NumericResponse]):
    async def score(self, data: Trace) -> NumericResponse:
        tool_calls = [span for span in data if span["span_kind"] == "tool"]

        return NumericResponse(
            value=float(len(tool_calls)),
            reason=f"Agent made {len(tool_calls)} tool call(s)."
        )

The score method:

  • Takes a Trace (list of TraceSpan objects) as input
  • Returns one of three response types:
    • BinaryResponse: For true/false evaluations with a value (bool) and reason (string)
    • CategoricalResponse: For categorical evaluations with a value (string) and reason (string)
    • NumericResponse: For numeric evaluations with a value (float) and reason (string)
  • All response types support optional citations to reference specific spans in your trace

See Trace for all available TraceSpan properties.

Create a requirements file

Create a requirements.txt file with any dependencies your scorer needs.

requirements.txt
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0

If you are accessing environmental variables in your custom scorer, you can set these on the Judgment platform once you have uploaded your scorer. Refer to this section for more information.

Upload Your Scorer

Once you've implemented your custom scorer, upload it to Judgment using the CLI.

Install the CLI

The CLI is included with the judgeval package:

pip install judgeval

Set your credentials

Set your Judgment API key and organization ID as environment variables:

export JUDGMENT_API_KEY="your-api-key"
export JUDGMENT_ORG_ID="your-org-id"

Upload the scorer

judgeval upload-scorer my_scorer.py -r requirements.txt -p my_project

You can also provide a custom name for the scorer:

judgeval upload-scorer my_scorer.py -r requirements.txt -n "Resolution Scorer" -p my_project

To overwrite an existing scorer with the same name:

judgeval upload-scorer my_scorer.py -r requirements.txt -o -p my_project

CLI Options

scorer_file_pathrequired

:str
Path to the Python file containing your scorer class.

--requirement, -r

:str
Path to the requirements.txt file with dependencies.

--name, -n

:str
Custom name for the scorer. Auto-detected from the class name if not provided.

--project, -p

:str
The project to upload the scorer to.

--overwrite, -o

:flag
Overwrite the scorer if it already exists.

Set Environmental Variables

You can also set environmental variables for your custom scorer on the Judgment platform.

Navigate to the Scorers page within your project and click on the Custom Scorers tab

Custom Scorers Page

Click on the custom scorer you would like to add the environmental variables for

Custom Scorer Page

Click on the "Environment variables" button on the top right of the page

Custom Scorer Env Vars Dialog

Enter in the environmental variable and click the "Add" button. Do this for all environmental variables needed for the custom scorer.

Use Your Uploaded Scorer

After uploading, you can use your custom scorer in evaluations by retrieving it through the client factory and running it within a traced function.

Use async_evaluate to run an ExampleCustomScorer on a specific example:

from judgeval import Judgeval
from judgeval.v1.data import Example

judgeval = Judgeval(project_name="my_project")

# Retrieve your uploaded scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="Resolution Scorer")

# Create a tracer
tracer = judgeval.tracer.create()

@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
    # Your agent logic here
    response = f"Your package will arrive tomorrow at 10:00 AM."
    
    # Evaluate the response asynchronously
    tracer.async_evaluate(
        scorer=scorer,
        example=Example.create(
            input=question,
            actual_output=response
        )
    )
    
    return response

# Run your agent
result = my_agent("Where is my package?")

Use async_trace_evaluate to run a TraceCustomScorer on the current trace:

from judgeval import Judgeval

judgeval = Judgeval(project_name="my_project")

# Retrieve your uploaded trace scorer by name
scorer = judgeval.scorers.custom_scorer.get(name="ToolCallScorer")

# Create a tracer
tracer = judgeval.tracer.create()

@tracer.observe(span_type="tool")
def search_database(query: str) -> str:
    # Tool logic here
    return f"Results for: {query}"

@tracer.observe(span_type="function")
def my_agent(question: str) -> str:
    # Your agent logic with tool calls
    search_result = search_database(question)
    response = f"Based on my search: {search_result}"
    
    # Evaluate the entire trace asynchronously
    tracer.async_trace_evaluate(scorer=scorer)
    
    return response

# Run your agent
result = my_agent("What products are available?")

Full Example

Here's a complete example of creating, uploading, and using a custom scorer:

Create the scorer

sentiment_scorer.py
from judgeval.v1.data import Example
from judgeval.v1.hosted import ExampleCustomScorer, CategoricalResponse

class SentimentScorer(ExampleCustomScorer[CategoricalResponse]):
    async def score(self, data: Example) -> CategoricalResponse:
        actual_output = data.get_property("actual_output")

        positive_words = ["great", "excellent", "happy", "pleased", "thank"]
        negative_words = ["sorry", "unfortunately", "cannot", "unable", "problem"]

        positive_count = sum(1 for word in positive_words if word in actual_output.lower())
        negative_count = sum(1 for word in negative_words if word in actual_output.lower())

        if positive_count > negative_count:
            return CategoricalResponse(
                value="positive",
                reason=f"Response has positive sentiment ({positive_count} positive, {negative_count} negative words)."
            )
        elif negative_count > positive_count:
            return CategoricalResponse(
                value="negative",
                reason=f"Response has negative sentiment ({positive_count} positive, {negative_count} negative words)."
            )

        return CategoricalResponse(
            value="neutral",
            reason="Response has neutral sentiment."
        )

Create requirements.txt

requirements.txt
# No external dependencies needed for this scorer

Upload the scorer

judgeval upload-scorer sentiment_scorer.py -r requirements.txt -n "Sentiment Scorer" -p sentiment_analysis

Use the scorer in an evaluation

from judgeval import Judgeval
from judgeval.v1.data import Example

judgeval = Judgeval(project_name="sentiment_analysis")

# Retrieve the uploaded scorer
scorer = judgeval.scorers.custom_scorer.get(name="Sentiment Scorer")

# Create a tracer
tracer = judgeval.tracer.create()

@tracer.observe(span_type="function")
def customer_support_agent(question: str) -> str:
    # Your agent logic here
    response = "I had a great experience! The service was excellent."
    
    # Evaluate the response
    tracer.async_evaluate(
        scorer=scorer,
        example=Example.create(
            input=question,
            actual_output=response
        )
    )
    
    return response

# Run your agent
result = customer_support_agent("How was your experience?")

Create the scorer

tool_call_scorer.py
from judgeval.v1.data import Trace
from judgeval.v1.hosted import TraceCustomScorer, NumericResponse

class ToolCallScorer(TraceCustomScorer[NumericResponse]):
    async def score(self, data: Trace) -> NumericResponse:
        tool_calls = [span for span in data if span["span_kind"] == "tool"]

        return NumericResponse(
            value=float(len(tool_calls)),
            reason=f"Agent made {len(tool_calls)} tool call(s)."
        )

Create requirements.txt

requirements.txt
# No external dependencies needed for this scorer

Upload the scorer

judgeval upload-scorer tool_call_scorer.py -r requirements.txt -n "Tool Call Scorer" -p agent_monitoring

Use the scorer in an evaluation

from judgeval import Judgeval

judgeval = Judgeval(project_name="agent_monitoring")

# Retrieve the uploaded trace scorer
scorer = judgeval.scorers.custom_scorer.get(name="Tool Call Scorer")

# Create a tracer
tracer = judgeval.tracer.create()

@tracer.observe(span_type="tool")
def search_products(query: str) -> str:
    return f"Found 5 products matching: {query}"

@tracer.observe(span_type="function")
def shopping_agent(question: str) -> str:
    # Agent makes tool calls
    products = search_products(question)
    response = f"Here are the results: {products}"
    
    # Evaluate the entire trace
    tracer.async_trace_evaluate(scorer=scorer)
    
    return response

# Run your agent
result = shopping_agent("Show me laptops under $1000")