Evaluation
Score a batch of examples using hosted scorers or custom judges.
Score a batch of examples using hosted scorers or custom judges.
Create an Evaluation via client.evaluation.create(), then call
.run() to execute scorers against your examples.
Two modes are supported:
- Hosted scorers -- pass scorer names as strings (e.g.
"faithfulness","answer_relevancy"). Evaluation runs server-side on the Judgment platform. - Custom judges -- pass
Judgesubclass instances for in-process evaluation with your own scoring logic.
Using hosted scorers:
evaluation = client.evaluation.create()
results = evaluation.run(
examples=examples,
scorers=["faithfulness", "answer_relevancy"],
eval_run_name="nightly-eval",
)
for result in results:
print(result.success, result.scorers_data)Using a custom judge:
evaluation = client.evaluation.create()
results = evaluation.run(
examples=examples,
scorers=[ToxicityJudge()],
eval_run_name="toxicity-check",
)__init__()
def __init__(client, project_id, project_name):Parameters
client
required:JudgmentSyncClient
project_id
required:Optional[str]
project_name
required:str
run()
Run scorers against your examples and return results.
Pass either hosted scorer names (strings) or custom Judge
instances. Mixing both in one call is not supported.
results = evaluation.run(
examples=[
Example.create(
input="What is Python?",
actual_output="A programming language.",
expected_output="A high-level programming language.",
),
],
scorers=["answer_relevancy"],
eval_run_name="quick-test",
)
print(results[0].success) # True/Falsedef run(examples, scorers, eval_run_name, assert_test=False, timeout_seconds=300) -> typing.List:Parameters
examples
required:List[Example]
The Example objects to evaluate.
scorers
required:Union[List[str], List[Judge]]
Hosted scorer names (e.g. ["faithfulness"]) or
Judge instances (e.g. [ToxicityJudge()]).
eval_run_name
required:str
A name for this run, visible in the dashboard.
assert_test
:bool
If True, raises an exception when any scorer fails its threshold. Useful in CI/CD pipelines.
False
timeout_seconds
:int
Maximum seconds to wait for hosted scorer results before timing out.
300
Returns
typing.List - A list of ScoringResult objects, one per example.