Judgment Labs Logo
PythonEvaluation

Evaluation

Score a batch of examples using hosted scorers or custom judges.

Score a batch of examples using hosted scorers or custom judges.

Create an Evaluation via client.evaluation.create(), then call .run() to execute scorers against your examples.

Two modes are supported:

  • Hosted scorers -- pass scorer names as strings (e.g. "faithfulness", "answer_relevancy"). Evaluation runs server-side on the Judgment platform.
  • Custom judges -- pass Judge subclass instances for in-process evaluation with your own scoring logic.

Using hosted scorers:

evaluation = client.evaluation.create()
results = evaluation.run(
    examples=examples,
    scorers=["faithfulness", "answer_relevancy"],
    eval_run_name="nightly-eval",
)
for result in results:
    print(result.success, result.scorers_data)

Using a custom judge:

evaluation = client.evaluation.create()
results = evaluation.run(
    examples=examples,
    scorers=[ToxicityJudge()],
    eval_run_name="toxicity-check",
)

__init__()

def __init__(client, project_id, project_name):

Parameters

client

required

:

JudgmentSyncClient

project_id

required

:

Optional[str]

project_name

required

:

str


run()

Run scorers against your examples and return results.

Pass either hosted scorer names (strings) or custom Judge instances. Mixing both in one call is not supported.

results = evaluation.run(
    examples=[
        Example.create(
            input="What is Python?",
            actual_output="A programming language.",
            expected_output="A high-level programming language.",
        ),
    ],
    scorers=["answer_relevancy"],
    eval_run_name="quick-test",
)
print(results[0].success)  # True/False
def run(examples, scorers, eval_run_name, assert_test=False, timeout_seconds=300) -> typing.List:

Parameters

examples

required

:

List[Example]

The Example objects to evaluate.

scorers

required

:

Union[List[str], List[Judge]]

Hosted scorer names (e.g. ["faithfulness"]) or Judge instances (e.g. [ToxicityJudge()]).

eval_run_name

required

:

str

A name for this run, visible in the dashboard.

assert_test

:

bool

If True, raises an exception when any scorer fails its threshold. Useful in CI/CD pipelines.

Default:

False

timeout_seconds

:

int

Maximum seconds to wait for hosted scorer results before timing out.

Default:

300

Returns

typing.List - A list of ScoringResult objects, one per example.