Code Judges

Score your agent behavior using code and LLMs.

judgeval provides abstractions to implement judges arbitrarily in code, enabling full flexibility in your scoring logic and use cases. You can use any combination of code, custom LLMs as a judge, or library dependencies.

Currently, Python is the only supported language for creating Code Judges.

Implement a Code Judge

Create a Code Judge by inheriting from Judge and implementing the score method. You must specify a response type (BinaryResponse, CategoricalResponse, or NumericResponse) as a generic parameter.

The score method receives an Example object, which can optionally carry a .trace for accessing trace data.

Additionally, judgeval allows you to upload scorers consisting of multiple files. Ensure that your entrypoint file contains the scorer class definition.

Create your judge file

You can use the CLI to generate skeleton code for your judge:

judgeval scorer init --response-type binary --name ResolutionScorer

This creates a resolution_scorer.py file with the appropriate boilerplate. Use --response-type to select binary, categorical, or numeric.

CLI Options

--response-type, -trequired

:str

Response type for the judge: binary, categorical, or numeric.

--name, -nrequired

:str

Class name for the generated judge (must be a valid Python identifier).

--init-path, -p

:str

Directory to write the generated file to. Defaults to the current directory.

--include-requirements, -r

:flag

Also create an empty requirements.txt file in the same directory.

--yes, -y

:flag
Skip the confirmation prompt.

If you prefer to write it manually, here's an example:

my_scorer.py
from judgeval.v1.judges import Judge, BinaryResponse
from judgeval.v1.data import Example

class ResolutionScorer(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        actual_output = data.get_property("actual_output")

        if "package" in actual_output:
            return BinaryResponse(
                value=True,
                reason="The response contains package information."
            )

        return BinaryResponse(
            value=False,
            reason="The response does not contain package information."
        )

The score method:

  • Takes an Example as input
  • Can access trace data via data.trace and data.trace.spans when available
  • Returns one of three response types:
    • BinaryResponse: For true/false evaluations with a value (bool) and reason (string)
    • CategoricalResponse: For categorical evaluations with a value (string) and reason (string)
    • NumericResponse: For numeric evaluations with a value (float) and reason (string)
  • All response types support optional citations to reference specific spans in your trace

Here's an example that uses trace data to score based on span information:

my_trace_scorer.py
from judgeval.v1.judges import Judge, NumericResponse
from judgeval.v1.data import Example

class ToolCallScorer(Judge[NumericResponse]):
    async def score(self, data: Example) -> NumericResponse:
        if not data.trace or not data.trace.spans:
            return NumericResponse(
                value=0.0,
                reason="No trace data available."
            )

        tool_calls = [span for span in data.trace.spans if span.get("span_kind") == "tool"]

        return NumericResponse(
            value=float(len(tool_calls)),
            reason=f"Agent made {len(tool_calls)} tool call(s)."
        )

Create a requirements file

Create a requirements.txt file with any dependencies your scorer needs. If you used judgeval scorer init with the --include-requirements flag to generate your judge, this file will already be created for you.

requirements.txt
# Add any dependencies your scorer needs
# openai>=1.0.0
# numpy>=1.24.0

If you are accessing environmental variables in your Code Judge, you can set these on the Judgment platform once you have uploaded your judge. Refer to this section for more information.

Run Locally

Judge instances can be passed directly to evaluation.create().run() without uploading to the platform. Judges execute locally on your machine in parallel, and results are saved to your project on the Judgment platform.

local_scoring.py
from judgeval import Judgeval
from judgeval.v1.data import Example
from judgeval.v1.judges import Judge, BinaryResponse

class ResolutionScorer(Judge[BinaryResponse]):
    async def score(self, data: Example) -> BinaryResponse:
        actual_output = data["actual_output"]

        if "package" in actual_output:
            return BinaryResponse(
                value=True,
                reason="The response contains package information."
            )

        return BinaryResponse(
            value=False,
            reason="The response does not contain package information."
        )

client = Judgeval(project_name="default_project")

examples = [
    Example.create(
        input="Where is my package?",
        actual_output="Your package will arrive tomorrow at 10:00 AM."
    ),
    Example.create(
        input="Where is my package?",
        actual_output="I don't know."
    )
]

results = client.evaluation.create().run(
    examples=examples,
    scorers=[ResolutionScorer()],
    eval_run_name="local_scoring_run"
)

You cannot mix local and hosted scorers in the same evaluation.create().run() call. Use separate calls if you need both.


Upload Your Judge

Once you've implemented your Code Judge, upload it to Judgment using the CLI.

Install the CLI

The CLI is included with the judgeval package:

pip install judgeval

Set your credentials

Set your Judgment API key and organization ID as environment variables:

export JUDGMENT_API_KEY="your-api-key"
export JUDGMENT_ORG_ID="your-org-id"

Upload the judge

The entrypoint file is the file that contains the scorer class definition. It should be passed as an argument to the judgeval scorer upload command:

judgeval scorer upload my_scorer.py -p my_project

You can also include additional files or directories needed by your judge:

judgeval scorer upload my_scorer.py -i utils/ -i shared_helpers.py -p my_project

To provide a requirements file and a custom name:

judgeval scorer upload my_scorer.py -r requirements.txt -n "Resolution Scorer" -p my_project

To bump the major version of an existing judge:

judgeval scorer upload my_scorer.py -m -p my_project

CLI Options

entrypoint_pathrequired

:str

Path to the entrypoint Python file containing your judge class.

--requirements, -r

:str

Path to the requirements.txt file with dependencies.

--included-files, -i

:str (repeatable)

Path to additional files or directories to bundle with your judge. If a directory is provided, all non-ignored files in that directory are included. Can be specified multiple times.

--name, -n

:str

Custom name for the judge. Auto-detected from the class name if not provided.

--project, -p

:str
The project to upload the judge to.

--bump-major, -m

:flag

Bump the major version of the judge. If unspecified, the minor version will be bumped.

--yes, -y

:flag
Skip the upload confirmation prompt.

Set Environmental Variables

You can also set environmental variables for your Code Judge on the Judgment platform.

Navigate to the Judges page within your project.

Code Judges Page

Click on the Code Judge you would like to add the environmental variables for and click the "Environment variables" button on the top right of the page

Code Judge Env Vars Dialog

Enter in the environmental variable and click the "Add" button. Do this for all environmental variables needed for the Code Judge.

Use Your Uploaded Judge

After uploading, you can use your Code Judge in online evaluations by calling Tracer.async_evaluate() within a traced function. Pass the judge name as a string.

from judgeval import Tracer

Tracer.init(project_name="my_project")

@Tracer.observe(span_type="function")
def my_agent(question: str) -> str:
    response = f"Your package will arrive tomorrow at 10:00 AM."

    Tracer.async_evaluate(
        judge="Resolution Scorer",
        example={
            "input": question,
            "actual_output": response,
        },
    )

    return response

result = my_agent("Where is my package?")