OfflineTestsFactory
Create test configs and execute offline test runs.
Create test configs and execute offline test runs.
Access this via client.offline_tests -- you don't instantiate it
directly. A test config pairs a dataset with a set of platform
judges; a test run evaluates one dataset version with pinned judge
versions and stores per-example results.
Create a config and run it:
config = client.offline_tests.create_config(
name="nightly-regression",
dataset="golden-set",
judges=["helpfulness", "faithfulness"],
)
result = client.offline_tests.run(test_config="nightly-regression")
print(result.ui_results_url)Agent testing with a pass condition:
def my_agent(input: str) -> str:
return agent.invoke(input)
result = client.offline_tests.run(
test_config="nightly-regression",
agent_function=my_agent,
pass_condition_fn=lambda fields, scorers: all(
s.value != "No" for s in scorers
),
assert_test=True,
)__init__()
def __init__(client, project_id, project_name):Parameters
client
required:JudgmentSyncClient
project_id
required:Optional[str]
project_name
required:str
create_config()
Create a test config binding a dataset to a set of judges.
JudgmentValidationError: If a judge does not exist or is incompatible with the dataset schema.
def create_config(name, dataset, judges, description=None) -> typing.Optional:Parameters
name
required:str
Name for the test config.
dataset
required:str
Dataset name or dataset ID.
judges
required:List[Union[str, Dict[str, str]]]
Judges to attach -- judge names (strings) or dicts
with a judge_id or name key.
description
:Optional[str]
Optional human-readable description.
None
Returns
typing.Optional - The created TestConfig, or None if the project is not
resolved.
get_config()
Fetch a test config by ID or name.
def get_config(test_config) -> typing.Optional:Parameters
test_config
required:str
Test config ID (UUID) or name.
Returns
typing.Optional - The TestConfig, or None if the project is not resolved or
no config matches.
list_configs()
List test configs in the project.
def list_configs(dataset_id=None) -> typing.Optional:Parameters
dataset_id
:Optional[str]
Optionally filter to configs for one dataset.
None
Returns
typing.Optional - A list of TestConfig objects, or None if the project is
not resolved.
delete_config()
Delete a test config by ID.
def delete_config(test_config_id) -> bool:Parameters
test_config_id
required:str
Returns
bool - True if the config was deleted, False if the project is not
resolved.
run()
Run an offline test for a test config.
Fetches the dataset version's examples, optionally calls your
agent entrypoint once per example (each call producing an offline
trace), then creates the test run with the agent traces attached
so server-side judges evaluate with the agent's trace in context.
Waits for results and, when pass_condition_fn is given, PATCHes
each row's pass/fail outcome onto the run's stored results.
Without agent_function, no agent is run: the judges score each
example's existing trace -- the dataset's trace-typed column (or the
example's offline_trace_id). With agent_function, the agent is
run once per example and the judges score the agent's trace instead.
ValueError: If assert_test is set without pass_condition_fn,
or if dataset_version does not match any version of the
config's dataset.
TypeError: If the agent entrypoint cannot accept an example's
fields.
JudgmentValidationError: If the server rejects the run (e.g.
unknown judge version, empty dataset).
JudgmentTestError: If assert_test is set and any row fails.
TimeoutError: If results are not ready within timeout_seconds.
def run(test_config, agent_function=None, judge_versions=None, dataset_version=None, pass_condition_fn=None, assert_test=False, timeout_seconds=600, run_name=None, field_mapping=None) -> typing.Optional:Parameters
test_config
required:Union[str, TestConfig]
Test config name, ID, or TestConfig object.
agent_function
:Optional[AgentFunction]
Optional agent entrypoint. Called once per
dataset example with the example's data fields as
same-named keyword arguments; a signature mismatch raises
TypeError. Each call is wrapped in an OfflineTracer
and its offline trace is attributed to the result row.
The agent runs before the test run is created; the
collected traces are attached at creation so judges see
them in context. When omitted, the judges score each
example's existing trace instead.
None
judge_versions
:Optional[List[JudgeVersionPin]]
Optional version pins, e.g.
[{"name": "helpfulness", "tag": "prod"}] or
[{"name": "helpfulness", "version": "1.2"}]. Judges not
listed default to their prod tag (else latest). Pinning
two versions of the same judge runs both (results are
attributed per version).
None
dataset_version
:Optional[int | str]
Dataset version to evaluate -- a version number (int) or version ID (str). Defaults to the latest version.
None
pass_condition_fn
:Optional[PassConditionFn]
Optional callable
(data_fields, scorer_data_list) -> bool evaluated per
example row; the outcome is stored as the row's success.
Each scorer's reason and string value arrive truncated
to 128 characters server-side, so conditions should key
off scores/prefixes rather than full reason text.
None
assert_test
:bool
Raise JudgmentTestError unless every row passes
its pass condition. Requires pass_condition_fn.
False
timeout_seconds
:int
Maximum seconds to wait for judge results.
600
run_name
:Optional[str]
Optional display name for this run. Defaults to an auto-generated name server-side when omitted.
None
field_mapping
:Optional[Dict[str, str]]
Optional map from an agent parameter name to the
dataset field it should read, for when a parameter is named
differently from its field (e.g. parameter input reading
field question). Unmapped parameters fall back to the field
of the same name. Example fields the agent does not declare are
ignored.
None
Returns
typing.Optional - An OfflineTestResult, or None if the project is not
resolved or the config cannot be found.
_is_uuid()
def _is_uuid(value) -> bool:Parameters
value
required:str
Returns
bool