Judge
A unified custom scorer class for creating specialized evaluation logic
A custom scorer class for creating specialized evaluation logic. The Judge class is generic and must be parameterized with a response type (BinaryResponse, CategoricalResponse, or NumericResponse).
The score method always receives an Example object. For trace-level scoring, access the trace data through data.trace.
Generic Parameters
Rrequired
:BinaryResponse | CategoricalResponse | NumericResponseThe response type that the scorer returns. Determines the structure of the scoring result.
Methods
scorerequired
:async defMeasures the score on an example. Must be implemented by subclasses. Returns a typed response based on the generic parameter.
async def score(self, data: Example) -> BinaryResponse:
# Custom scoring logic here
return BinaryResponse(value=True, reason="...")Response Types
All response types include a value, reason, and optional citations field:
-
BinaryResponse: For true/false evaluations
value(bool): The boolean resultreason(str): Explanation for the resultcitations(Optional[List[Citation]]): References to specific trace spans
-
CategoricalResponse: For categorical evaluations
value(str): The category namereason(str): Explanation for the resultcitations(Optional[List[Citation]]): References to specific trace spans
-
NumericResponse: For numeric evaluations
value(float): The numeric scorereason(str): Explanation for the resultcitations(Optional[List[Citation]]): References to specific trace spans
Usage
Example-Level Scoring
from judgeval.v1.judges import Judge, BinaryResponse
from judgeval.v1.data import Example
class CorrectnessScorer(Judge[BinaryResponse]):
async def score(self, data: Example) -> BinaryResponse:
actual_output = data.get_property("actual_output")
expected_output = data.get_property("expected_output")
if expected_output in actual_output:
return BinaryResponse(
value=True,
reason="The answer contains the expected output."
)
return BinaryResponse(
value=False,
reason="The answer does not contain the expected output."
)
client = Judgeval(project_name="my_project")
examples = [
Example.create(input="What is 2+2?", actual_output="4", expected_output="4"),
Example.create(input="What is 2+2?", actual_output="5", expected_output="4"),
]
results = client.evaluation.create().run(
examples=examples,
scorers=[CorrectnessScorer()],
eval_run_name="correctness_run"
)Trace-Level Scoring
For trace-level scoring, access data.trace on the Example object. See Trace for all available TraceSpan properties.
from judgeval.v1.judges import Judge, NumericResponse
from judgeval.v1.data import Example
class ToolCallScorer(Judge[NumericResponse]):
async def score(self, data: Example) -> NumericResponse:
if not data.trace or not data.trace.spans:
return NumericResponse(
value=0.0,
reason="No trace data available."
)
tool_calls = [span for span in data.trace.spans if span.get("span_kind") == "tool"]
return NumericResponse(
value=float(len(tool_calls)),
reason=f"Agent made {len(tool_calls)} tool call(s)."
)Implement and run locally
Pass an Example-level scorer directly to evaluation.create().run() to score examples locally without uploading to the platform:
from judgeval import Judgeval
from judgeval.v1.data import Example
from judgeval.v1.judges import Judge, BinaryResponse
class CorrectnessScorer(Judge[BinaryResponse]):
async def score(self, data: Example) -> BinaryResponse:
actual_output = data["actual_output"]
expected_output = data["expected_output"]
if expected_output in actual_output:
return BinaryResponse(
value=True,
reason="The answer contains the expected output."
)
return BinaryResponse(
value=False,
reason="The answer does not contain the expected output."
)
client = Judgeval(project_name="my_project")
examples = [
Example.create(input="What is 2+2?", actual_output="4", expected_output="4"),
Example.create(input="What is 2+2?", actual_output="5", expected_output="4"),
]
results = client.evaluation.create().run(
examples=examples,
scorers=[CorrectnessScorer()],
eval_run_name="correctness_run"
)