Judgment Labs Logo
PythonOffline TestsOffline Test Runner

OfflineTestRunner

Executes the offline-test lifecycle for a test config.

Executes the offline-test lifecycle for a test config.

Used by client.offline_tests.run() -- you don't instantiate it directly. The lifecycle is:

  1. Resolve the dataset version under test and fetch its examples.
  2. Optionally call the agent entrypoint once per dataset example, wrapped in an OfflineTracer so each call produces an offline trace. All traces are flushed before the run is created.
  3. POST test-runs -- creates the run pinned to the dataset version fetched in step 1, attaching any agent traces (agent_traces) so server-side judge evaluation is queued with the agent's trace in judge context.
  4. Wait for the run to reach a terminal status and fetch per-example results.
  5. Evaluate the optional pass_condition_fn per row and PATCH the per-evaluation-run success outcomes onto the test run. Skipped entirely when no pass condition is given.

__init__()

def __init__(client, project_id, project_name):

Parameters

client

required

:

JudgmentSyncClient

project_id

required

:

str

project_name

required

:

str


resolve_dataset_version()

Resolve the dataset version a run will evaluate.

Returns the raw version entry (version_id, version_number, ...) for the requested version -- a version number (int), a version ID (str), or the latest version when dataset_version is None. The resolved version is the one whose examples are fetched and the one pinned at run creation, so the two always match.

ValueError: If the dataset has no versions or no version matches dataset_version.

def resolve_dataset_version(test_config, dataset_version=None) -> Dict[str, Any]:

Parameters

test_config

required

:

TestConfig

dataset_version

:

Optional[int | str]

Default:

None

Returns

Dict[str, Any]


fetch_examples()

Fetch every example of one dataset version.

Pages through the dataset's /page endpoint, which returns:

{
  "dataset": {...},
  "entries": [
    {
      "item":    {"id": "...", "version_added": 1, "example_id": "...", "created_at": "<item ts>", ...},
      "example": {"example_id": "...", "data": {...}, "offline_trace_id": "...|null", "metadata": {...}, "created_at": "<example ts>"}
    }
  ],
  "metadata": {"hasMore": true|false, "nextCursor": {"created_at": "...", "example_id": "..."} | null}
}

Returns a list of dicts with keys example_id, data (always a dict), offline_trace_id, and created_at drawn from the nested example object. Pagination uses metadata.nextCursor verbatim; the cursor is never derived from individual entries.

def fetch_examples(test_config, version_number) -> List[Dict[str, Any]]:

Parameters

test_config

required

:

TestConfig

version_number

required

:

int

Returns

List[Dict[str, Any]]


create_test_run()

Create a test run and return the prepared payload.

Creation queues server-side judge evaluation immediately. When agent_traces is provided, each {example_id, agent_offline_trace_id} pair is attached so judges evaluate with the agent's trace in context; the server validates the example IDs against the dataset version (422 on unknown or duplicate IDs). Callers must therefore flush agent traces before calling this.

def create_test_run(test_config, dataset_version=None, judge_versions=None, source='sdk', agent_traces=None, name=None) -> Dict[str, Any]:

Parameters

test_config

required

:

TestConfig

dataset_version

:

Optional[int | str]

Default:

None

judge_versions

:

Optional[List[JudgeVersionPin]]

Default:

None

source

:

str

Default:

'sdk'

agent_traces

:

Optional[Dict[str, str]]

Default:

None

name

:

Optional[str]

Default:

None

Returns

Dict[str, Any]


run_agent()

Call the agent entrypoint once per dataset example.

Each call is wrapped in the OfflineTracer machinery so it produces a dedicated offline trace; the resulting trace IDs are returned keyed by example ID. Entrypoint/example field mismatches raise immediately; runtime errors inside the agent are recorded on the trace and logged, and the loop continues. The previously active tracer (if any) is restored once the loop finishes, so subsequent @observe spans do not route to the offline endpoint.

Before returning, the offline tracer is force-flushed and its provider shut down, so every agent trace is exported by the time the test run is created with these trace IDs attached.

def run_agent(agent_function, examples, progress=None, field_mapping=None) -> Dict[str, str]:

Parameters

agent_function

required

:

AgentFunction

examples

required

:

List[Dict[str, Any]]

progress

:

Optional[Progress]

Default:

None

field_mapping

:

Optional[Dict[str, str]]

Default:

None

Returns

Dict[str, str]


wait_for_completion()

Poll the test run until it reaches a terminal status.

def wait_for_completion(test_run_id, timeout_seconds, progress=None) -> str:

Parameters

test_run_id

required

:

str

timeout_seconds

required

:

int

progress

:

Optional[Progress]

Default:

None

Returns

str


fetch_items()

Fetch every per-example scorer row for a test run.

Pages through GET .../test-runs/{id}/items with limit / cursor query params (keyset-paginated by example_id; the server caps pages at 200 rows) until has_more is false or next_cursor is null. Older servers that do not paginate omit both fields and are treated as a single full page.

The server truncates each scorer's str_value, reason, and error to 128 characters in these rows; follow ui_results_url for the full values.

Uses the raw JudgmentSyncClient._request helper (same httpx machinery, headers, and error handling) because the generated client does not expose the limit/cursor query params yet. A future client regen will pick them up.

def fetch_items(test_run_id) -> Tuple[List[Dict[str, Any]], str]:

Parameters

test_run_id

required

:

str

Returns

Tuple[List[Dict[str, Any]], str]


build_results()

Convert raw test-run items into ScoringResult objects.

When pass_condition_fn is provided it is called once per row with (data_fields, scorer_data_list) and its boolean outcome is recorded on every ScorerData of the row.

def build_results(items, agent_traces, pass_condition_fn=None) -> List[ScoringResult]:

Parameters

items

required

:

List[Dict[str, Any]]

agent_traces

required

:

Dict[str, str]

pass_condition_fn

:

Optional[PassConditionFn]

Default:

None

Returns

List[ScoringResult]


report_success()

PATCH per-row pass-condition outcomes onto the test run.

Sends one {evaluation_run_id, success} entry per scorer row to PATCH .../test-runs/{id}/success. Each row's evaluation_run_id comes from the prepare response refs (matched by judge version, falling back to judge name); the server validates the IDs and re-inserts its own stored rows with the new success -- nothing else is echoed back.

def report_success(test_run_id, prepared, items, results) -> None:

Parameters

test_run_id

required

:

str

prepared

required

:

Dict[str, Any]

items

required

:

List[Dict[str, Any]]

results

required

:

List[ScoringResult]

Returns

None


_patch_test_run_success()

Call PATCH .../test-runs/{id}/success via the raw client.

The route is not in the generated client yet, so this reuses JudgmentSyncClient._request (same httpx machinery, headers, and error handling). A future client regen will pick up the route.

def _patch_test_run_success(test_run_id, successes) -> Dict[str, Any]:

Parameters

test_run_id

required

:

str

successes

required

:

List[Dict[str, Any]]

Returns

Dict[str, Any]


run()

Execute the full offline-test lifecycle for a test config.

When agent_function is omitted, no agent is invoked: the judges score each example's existing trace (the dataset's trace-typed column / offline_trace_id). When provided, the agent is run once per example first and the judges score the resulting agent trace.

def run(test_config, agent_function=None, judge_versions=None, dataset_version=None, pass_condition_fn=None, assert_test=False, timeout_seconds=600, run_name=None, field_mapping=None) -> OfflineTestResult:

Parameters

test_config

required

:

TestConfig

agent_function

:

Optional[AgentFunction]

Default:

None

judge_versions

:

Optional[List[JudgeVersionPin]]

Default:

None

dataset_version

:

Optional[int | str]

Default:

None

pass_condition_fn

:

Optional[PassConditionFn]

Default:

None

assert_test

:

bool

Default:

False

timeout_seconds

:

int

Default:

600

run_name

:

Optional[str]

Default:

None

field_mapping

:

Optional[Dict[str, str]]

Default:

None

Returns

OfflineTestResult


_assert_all_passed()

def _assert_all_passed(outcome) -> None:

Parameters

outcome

required

:

OfflineTestResult

Returns

None


_display_results()

def _display_results(console, status, results, ui_results_url) -> None:

Parameters

console

required

:

Console

status

required

:

str

results

required

:

List[ScoringResult]

ui_results_url

required

:

str

Returns

None