Judgment Labs Logo
PythonOffline Tests

OfflineTestsFactory

Create test configs and execute offline test runs.

Create test configs and execute offline test runs.

Access this via client.offline_tests -- you don't instantiate it directly. A test config pairs a dataset with a set of platform judges; a test run evaluates one dataset version with pinned judge versions and stores per-example results.

Create a config and run it:

config = client.offline_tests.create_config(
    name="nightly-regression",
    dataset="golden-set",
    judges=["helpfulness", "faithfulness"],
)

result = client.offline_tests.run(test_config="nightly-regression")
print(result.ui_results_url)

Agent testing with a pass condition:

def my_agent(input: str) -> str:
    return agent.invoke(input)

result = client.offline_tests.run(
    test_config="nightly-regression",
    agent_function=my_agent,
    pass_condition_fn=lambda fields, scorers: all(
        s.value != "No" for s in scorers
    ),
    assert_test=True,
)

__init__()

def __init__(client, project_id, project_name):

Parameters

client

required

:

JudgmentSyncClient

project_id

required

:

Optional[str]

project_name

required

:

str


create_config()

Create a test config binding a dataset to a set of judges.

JudgmentValidationError: If a judge does not exist or is incompatible with the dataset schema.

def create_config(name, dataset, judges, description=None) -> typing.Optional:

Parameters

name

required

:

str

Name for the test config.

dataset

required

:

str

Dataset name or dataset ID.

judges

required

:

List[Union[str, Dict[str, str]]]

Judges to attach -- judge names (strings) or dicts with a judge_id or name key.

description

:

Optional[str]

Optional human-readable description.

Default:

None

Returns

typing.Optional - The created TestConfig, or None if the project is not resolved.


get_config()

Fetch a test config by ID or name.

def get_config(test_config) -> typing.Optional:

Parameters

test_config

required

:

str

Test config ID (UUID) or name.

Returns

typing.Optional - The TestConfig, or None if the project is not resolved or no config matches.


list_configs()

List test configs in the project.

def list_configs(dataset_id=None) -> typing.Optional:

Parameters

dataset_id

:

Optional[str]

Optionally filter to configs for one dataset.

Default:

None

Returns

typing.Optional - A list of TestConfig objects, or None if the project is not resolved.


delete_config()

Delete a test config by ID.

def delete_config(test_config_id) -> bool:

Parameters

test_config_id

required

:

str

Returns

bool - True if the config was deleted, False if the project is not resolved.


run()

Run an offline test for a test config.

Fetches the dataset version's examples, optionally calls your agent entrypoint once per example (each call producing an offline trace), then creates the test run with the agent traces attached so server-side judges evaluate with the agent's trace in context. Waits for results and, when pass_condition_fn is given, PATCHes each row's pass/fail outcome onto the run's stored results.

Without agent_function, no agent is run: the judges score each example's existing trace -- the dataset's trace-typed column (or the example's offline_trace_id). With agent_function, the agent is run once per example and the judges score the agent's trace instead.

ValueError: If assert_test is set without pass_condition_fn, or if dataset_version does not match any version of the config's dataset. TypeError: If the agent entrypoint cannot accept an example's fields. JudgmentValidationError: If the server rejects the run (e.g. unknown judge version, empty dataset). JudgmentTestError: If assert_test is set and any row fails. TimeoutError: If results are not ready within timeout_seconds.

def run(test_config, agent_function=None, judge_versions=None, dataset_version=None, pass_condition_fn=None, assert_test=False, timeout_seconds=600, run_name=None, field_mapping=None) -> typing.Optional:

Parameters

test_config

required

:

Union[str, TestConfig]

Test config name, ID, or TestConfig object.

agent_function

:

Optional[AgentFunction]

Optional agent entrypoint. Called once per dataset example with the example's data fields as same-named keyword arguments; a signature mismatch raises TypeError. Each call is wrapped in an OfflineTracer and its offline trace is attributed to the result row. The agent runs before the test run is created; the collected traces are attached at creation so judges see them in context. When omitted, the judges score each example's existing trace instead.

Default:

None

judge_versions

:

Optional[List[JudgeVersionPin]]

Optional version pins, e.g. [{"name": "helpfulness", "tag": "prod"}] or [{"name": "helpfulness", "version": "1.2"}]. Judges not listed default to their prod tag (else latest). Pinning two versions of the same judge runs both (results are attributed per version).

Default:

None

dataset_version

:

Optional[int | str]

Dataset version to evaluate -- a version number (int) or version ID (str). Defaults to the latest version.

Default:

None

pass_condition_fn

:

Optional[PassConditionFn]

Optional callable (data_fields, scorer_data_list) -> bool evaluated per example row; the outcome is stored as the row's success. Each scorer's reason and string value arrive truncated to 128 characters server-side, so conditions should key off scores/prefixes rather than full reason text.

Default:

None

assert_test

:

bool

Raise JudgmentTestError unless every row passes its pass condition. Requires pass_condition_fn.

Default:

False

timeout_seconds

:

int

Maximum seconds to wait for judge results.

Default:

600

run_name

:

Optional[str]

Optional display name for this run. Defaults to an auto-generated name server-side when omitted.

Default:

None

field_mapping

:

Optional[Dict[str, str]]

Optional map from an agent parameter name to the dataset field it should read, for when a parameter is named differently from its field (e.g. parameter input reading field question). Unmapped parameters fall back to the field of the same name. Example fields the agent does not declare are ignored.

Default:

None

Returns

typing.Optional - An OfflineTestResult, or None if the project is not resolved or the config cannot be found.


_is_uuid()

def _is_uuid(value) -> bool:

Parameters

value

required

:

str

Returns

bool