OfflineTestsFactory

Create test configs and execute offline test runs.

Access this via client.offline_tests -- you don't instantiate it directly. A test config pairs a dataset with a set of platform judges; a test run evaluates one dataset version with pinned judge versions and stores per-example results.

Create a config and run it:

config = client.offline_tests.create_config(
    name="nightly-regression",
    dataset="golden-set",
    judges=["helpfulness", "faithfulness"],
)

result = client.offline_tests.run(test_config="nightly-regression")
print(result.ui_results_url)

Agent testing with a pass condition:

def my_agent(input: str) -> str:
    return agent.invoke(input)

result = client.offline_tests.run(
    test_config="nightly-regression",
    agent_function=my_agent,
    pass_condition_fn=lambda fields, scorers: all(
        s.value != "No" for s in scorers
    ),
    assert_test=True,
)

init()

def __init__(client, project_id, project_name):

Parameters

client
required

JudgmentSyncClient

project_id
required

Optional[str]

project_name
required

str

create_config()

Create a test config binding a dataset to a set of judges.

JudgmentValidationError: If a judge does not exist or is incompatible with the dataset schema.

def create_config(name, dataset, judges, description=None) -> typing.Optional:

Parameters

name
required

str

Name for the test config.

dataset
required

str

Dataset name or dataset ID.

judges
required

List[Union[str, Dict[str, str]]]

Judges to attach -- judge names (strings) or dicts with a judge_id or name key.

description

Optional[str]

Optional human-readable description.

Default:

None

Returns

typing.Optional - The created TestConfig, or None if the project is not resolved.

get_config()

Fetch a test config by ID or name.

def get_config(test_config) -> typing.Optional:

Parameters

test_config
required

str

Test config ID (UUID) or name.

Returns

typing.Optional - The TestConfig, or None if the project is not resolved or no config matches.

list_configs()

List test configs in the project.

def list_configs(dataset_id=None) -> typing.Optional:

Parameters

dataset_id

Optional[str]

Optionally filter to configs for one dataset.

Default:

None

Returns

typing.Optional - A list of TestConfig objects, or None if the project is not resolved.

delete_config()

Delete a test config by ID.

def delete_config(test_config_id) -> bool:

Parameters

test_config_id
required

str

Returns

bool - True if the config was deleted, False if the project is not resolved.

run()

Run an offline test for a test config.

Fetches the dataset version's examples, optionally calls your agent entrypoint once per example (each call producing an offline trace), then creates the test run with the agent traces attached so server-side judges evaluate with the agent's trace in context. Waits for results and, when pass_condition_fn is given, PATCHes each row's pass/fail outcome onto the run's stored results.

Without agent_function, no agent is run: the judges score each example's existing trace -- the dataset's trace-typed column (or the example's offline_trace_id). With agent_function, the agent is run once per example and the judges score the agent's trace instead.

ValueError: If assert_test is set without pass_condition_fn, or if dataset_version does not match any version of the config's dataset. TypeError: If the agent entrypoint cannot accept an example's fields. JudgmentValidationError: If the server rejects the run (e.g. unknown judge version, empty dataset). JudgmentTestError: If assert_test is set and any row fails. TimeoutError: If results are not ready within timeout_seconds.

def run(test_config, agent_function=None, judge_versions=None, dataset_version=None, pass_condition_fn=None, assert_test=False, timeout_seconds=600, run_name=None, field_mapping=None) -> typing.Optional:

Parameters

test_config
required

Union[str, TestConfig]

Test config name, ID, or TestConfig object.

agent_function

Optional[AgentFunction]

Optional agent entrypoint. Called once per dataset example with the example's data fields as same-named keyword arguments; a signature mismatch raises TypeError. Each call is wrapped in an OfflineTracer and its offline trace is attributed to the result row. The agent runs before the test run is created; the collected traces are attached at creation so judges see them in context. When omitted, the judges score each example's existing trace instead.

Default:

None

judge_versions

Optional[List[JudgeVersionPin]]

Optional version pins, e.g. [{"name": "helpfulness", "tag": "prod"}] or [{"name": "helpfulness", "version": "1.2"}]. Judges not listed default to their prod tag (else latest). Pinning two versions of the same judge runs both (results are attributed per version).

Default:

None

dataset_version

Optional[int | str]

Dataset version to evaluate -- a version number (int) or version ID (str). Defaults to the latest version.

Default:

None

pass_condition_fn

Optional[PassConditionFn]

Optional callable (data_fields, scorer_data_list) -> bool evaluated per example row; the outcome is stored as the row's success. Each scorer's reason and string value arrive truncated to 128 characters server-side, so conditions should key off scores/prefixes rather than full reason text.

Default:

None

assert_test

bool

Raise JudgmentTestError unless every row passes its pass condition. Requires pass_condition_fn.

Default:

False

timeout_seconds

int

Maximum seconds to wait for judge results.

Default:

600

run_name

Optional[str]

Optional display name for this run. Defaults to an auto-generated name server-side when omitted.

Default:

None

field_mapping

Optional[Dict[str, str]]

Optional map from an agent parameter name to the dataset field it should read, for when a parameter is named differently from its field (e.g. parameter input reading field question). Unmapped parameters fall back to the field of the same name. Example fields the agent does not declare are ignored.

Default:

None

Returns

typing.Optional - An OfflineTestResult, or None if the project is not resolved or the config cannot be found.

_is_uuid()

def _is_uuid(value) -> bool:

Parameters

value
required

str

Returns

bool

OfflineTestsFactory

clientrequired

project_idrequired

project_namerequired

namerequired

datasetrequired

judgesrequired

description

test_configrequired

dataset_id

test_config_idrequired

test_configrequired

agent_function

judge_versions

dataset_version

pass_condition_fn

assert_test

timeout_seconds

run_name

field_mapping

valuerequired

On this page

client
required

project_id
required

project_name
required

name
required

dataset
required

judges
required

test_config
required

test_config_id
required

test_config
required

value
required