Judge

A custom scorer class for creating specialized evaluation logic. The Judge class is generic and must be parameterized with a response type (BinaryResponse, CategoricalResponse, or NumericResponse).

The score method always receives an Example object. For trace-level scoring, access the trace data through data.trace.

Generic Parameters

Rrequired

:BinaryResponse | CategoricalResponse | NumericResponse

The response type that the scorer returns. Determines the structure of the scoring result.

Methods

scorerequired

:async def

Measures the score on an example. Must be implemented by subclasses. Returns a typed response based on the generic parameter.

score.py

async def score(self, data: Example) -> BinaryResponse:
    # Custom scoring logic here
    return BinaryResponse(value=True, reason="...")

Response Types

All response types include a value, reason, and optional citations field:

BinaryResponse: For true/false evaluations
- value (bool): The boolean result
- reason (str): Explanation for the result
- citations (Optional[List[Citation]]): References to specific trace spans
CategoricalResponse: For categorical evaluations
- value (str): The category name
- reason (str): Explanation for the result
- citations (Optional[List[Citation]]): References to specific trace spans
NumericResponse: For numeric evaluations
- value (float): The numeric score
- reason (str): Explanation for the result
- citations (Optional[List[Citation]]): References to specific trace spans

Usage

Example-Level Scoring

from judgeval.v1.judges import Judge, BinaryResponse
from judgeval.v1.data import Example

class CorrectnessScorer(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        actual_output = data.get_property("actual_output")
        expected_output = data.get_property("expected_output")

        if expected_output in actual_output:
            return BinaryResponse(
                value=True,
                reason="The answer contains the expected output."
            )

        return BinaryResponse(
            value=False,
            reason="The answer does not contain the expected output."
        )

client = Judgeval(project_name="my_project")

examples = [
    Example.create(input="What is 2+2?", actual_output="4", expected_output="4"),
    Example.create(input="What is 2+2?", actual_output="5", expected_output="4"),
]

results = client.evaluation.create().run(
    examples=examples,
    scorers=[CorrectnessScorer()],
    eval_run_name="correctness_run"
)

Trace-Level Scoring

For trace-level scoring, access data.trace on the Example object. See Trace for all available TraceSpan properties.

from judgeval.v1.judges import Judge, NumericResponse
from judgeval.v1.data import Example

class ToolCallScorer(Judge[NumericResponse]):
    async def score(self, data: Example) -> NumericResponse:
        if not data.trace or not data.trace.spans:
            return NumericResponse(
                value=0.0,
                reason="No trace data available."
            )

        tool_calls = [span for span in data.trace.spans if span.get("span_kind") == "tool"]

        return NumericResponse(
            value=float(len(tool_calls)),
            reason=f"Agent made {len(tool_calls)} tool call(s)."
        )

Implement and run locally

Pass an Example-level scorer directly to evaluation.create().run() to score examples locally without uploading to the platform:

from judgeval import Judgeval
from judgeval.v1.data import Example
from judgeval.v1.judges import Judge, BinaryResponse

class CorrectnessScorer(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        actual_output = data["actual_output"]
        expected_output = data["expected_output"]

        if expected_output in actual_output:
            return BinaryResponse(
                value=True,
                reason="The answer contains the expected output."
            )

        return BinaryResponse(
            value=False,
            reason="The answer does not contain the expected output."
        )

client = Judgeval(project_name="my_project")

examples = [
    Example.create(input="What is 2+2?", actual_output="4", expected_output="4"),
    Example.create(input="What is 2+2?", actual_output="5", expected_output="4"),
]

results = client.evaluation.create().run(
    examples=examples,
    scorers=[CorrectnessScorer()],
    eval_run_name="correctness_run"
)