OfflineTestRunner
Executes the offline-test lifecycle for a test config.
Executes the offline-test lifecycle for a test config.
Used by client.offline_tests.run() -- you don't instantiate it
directly. The lifecycle is:
- Resolve the dataset version under test and fetch its examples.
- Optionally call the agent entrypoint once per dataset example,
wrapped in an
OfflineTracerso each call produces an offline trace. All traces are flushed before the run is created. POST test-runs-- creates the run pinned to the dataset version fetched in step 1, attaching any agent traces (agent_traces) so server-side judge evaluation is queued with the agent's trace in judge context.- Wait for the run to reach a terminal status and fetch per-example results.
- Evaluate the optional
pass_condition_fnper row and PATCH the per-evaluation-runsuccessoutcomes onto the test run. Skipped entirely when no pass condition is given.
__init__()
def __init__(client, project_id, project_name):Parameters
client
required:JudgmentSyncClient
project_id
required:str
project_name
required:str
resolve_dataset_version()
Resolve the dataset version a run will evaluate.
Returns the raw version entry (version_id, version_number,
...) for the requested version -- a version number (int), a
version ID (str), or the latest version when dataset_version
is None. The resolved version is the one whose examples are
fetched and the one pinned at run creation, so the two always
match.
ValueError: If the dataset has no versions or no version
matches dataset_version.
def resolve_dataset_version(test_config, dataset_version=None) -> Dict[str, Any]:Parameters
test_config
required:TestConfig
dataset_version
:Optional[int | str]
None
Returns
Dict[str, Any]
fetch_examples()
Fetch every example of one dataset version.
Pages through the dataset's /page endpoint, which returns:
{
"dataset": {...},
"entries": [
{
"item": {"id": "...", "version_added": 1, "example_id": "...", "created_at": "<item ts>", ...},
"example": {"example_id": "...", "data": {...}, "offline_trace_id": "...|null", "metadata": {...}, "created_at": "<example ts>"}
}
],
"metadata": {"hasMore": true|false, "nextCursor": {"created_at": "...", "example_id": "..."} | null}
}Returns a list of dicts with keys example_id, data (always a
dict), offline_trace_id, and created_at drawn from the
nested example object. Pagination uses metadata.nextCursor
verbatim; the cursor is never derived from individual entries.
def fetch_examples(test_config, version_number) -> List[Dict[str, Any]]:Parameters
test_config
required:TestConfig
version_number
required:int
Returns
List[Dict[str, Any]]
create_test_run()
Create a test run and return the prepared payload.
Creation queues server-side judge evaluation immediately. When
agent_traces is provided, each
{example_id, agent_offline_trace_id} pair is attached so judges
evaluate with the agent's trace in context; the server validates
the example IDs against the dataset version (422 on unknown or
duplicate IDs). Callers must therefore flush agent traces before
calling this.
def create_test_run(test_config, dataset_version=None, judge_versions=None, source='sdk', agent_traces=None, name=None) -> Dict[str, Any]:Parameters
test_config
required:TestConfig
dataset_version
:Optional[int | str]
None
judge_versions
:Optional[List[JudgeVersionPin]]
None
source
:str
'sdk'
agent_traces
:Optional[Dict[str, str]]
None
name
:Optional[str]
None
Returns
Dict[str, Any]
run_agent()
Call the agent entrypoint once per dataset example.
Each call is wrapped in the OfflineTracer machinery so it
produces a dedicated offline trace; the resulting trace IDs are
returned keyed by example ID. Entrypoint/example field mismatches
raise immediately; runtime errors inside the agent are recorded on
the trace and logged, and the loop continues. The previously
active tracer (if any) is restored once the loop finishes, so
subsequent @observe spans do not route to the offline endpoint.
Before returning, the offline tracer is force-flushed and its provider shut down, so every agent trace is exported by the time the test run is created with these trace IDs attached.
def run_agent(agent_function, examples, progress=None, field_mapping=None) -> Dict[str, str]:Parameters
agent_function
required:AgentFunction
examples
required:List[Dict[str, Any]]
progress
:Optional[Progress]
None
field_mapping
:Optional[Dict[str, str]]
None
Returns
Dict[str, str]
wait_for_completion()
Poll the test run until it reaches a terminal status.
def wait_for_completion(test_run_id, timeout_seconds, progress=None) -> str:Parameters
test_run_id
required:str
timeout_seconds
required:int
progress
:Optional[Progress]
None
Returns
str
fetch_items()
Fetch every per-example scorer row for a test run.
Pages through GET .../test-runs/{id}/items with limit /
cursor query params (keyset-paginated by example_id; the
server caps pages at 200 rows) until has_more is false or
next_cursor is null. Older servers that do not paginate omit
both fields and are treated as a single full page.
The server truncates each scorer's str_value, reason, and
error to 128 characters in these rows; follow
ui_results_url for the full values.
Uses the raw JudgmentSyncClient._request helper (same httpx
machinery, headers, and error handling) because the generated
client does not expose the limit/cursor query params yet.
A future client regen will pick them up.
def fetch_items(test_run_id) -> Tuple[List[Dict[str, Any]], str]:Parameters
test_run_id
required:str
Returns
Tuple[List[Dict[str, Any]], str]
build_results()
Convert raw test-run items into ScoringResult objects.
When pass_condition_fn is provided it is called once per row
with (data_fields, scorer_data_list) and its boolean outcome is
recorded on every ScorerData of the row.
def build_results(items, agent_traces, pass_condition_fn=None) -> List[ScoringResult]:Parameters
items
required:List[Dict[str, Any]]
agent_traces
required:Dict[str, str]
pass_condition_fn
:Optional[PassConditionFn]
None
Returns
List[ScoringResult]
report_success()
PATCH per-row pass-condition outcomes onto the test run.
Sends one {evaluation_run_id, success} entry per scorer row to
PATCH .../test-runs/{id}/success. Each row's evaluation_run_id
comes from the prepare response refs (matched by judge version,
falling back to judge name); the server validates the IDs and
re-inserts its own stored rows with the new success -- nothing
else is echoed back.
def report_success(test_run_id, prepared, items, results) -> None:Parameters
test_run_id
required:str
prepared
required:Dict[str, Any]
items
required:List[Dict[str, Any]]
results
required:List[ScoringResult]
Returns
None
_patch_test_run_success()
Call PATCH .../test-runs/{id}/success via the raw client.
The route is not in the generated client yet, so this reuses
JudgmentSyncClient._request (same httpx machinery, headers, and
error handling). A future client regen will pick up the route.
def _patch_test_run_success(test_run_id, successes) -> Dict[str, Any]:Parameters
test_run_id
required:str
successes
required:List[Dict[str, Any]]
Returns
Dict[str, Any]
run()
Execute the full offline-test lifecycle for a test config.
When agent_function is omitted, no agent is invoked: the judges
score each example's existing trace (the dataset's trace-typed
column / offline_trace_id). When provided, the agent is run once
per example first and the judges score the resulting agent trace.
def run(test_config, agent_function=None, judge_versions=None, dataset_version=None, pass_condition_fn=None, assert_test=False, timeout_seconds=600, run_name=None, field_mapping=None) -> OfflineTestResult:Parameters
test_config
required:TestConfig
agent_function
:Optional[AgentFunction]
None
judge_versions
:Optional[List[JudgeVersionPin]]
None
dataset_version
:Optional[int | str]
None
pass_condition_fn
:Optional[PassConditionFn]
None
assert_test
:bool
False
timeout_seconds
:int
600
run_name
:Optional[str]
None
field_mapping
:Optional[Dict[str, str]]
None
Returns
OfflineTestResult
_assert_all_passed()
def _assert_all_passed(outcome) -> None:Parameters
outcome
required:OfflineTestResult
Returns
None
_display_results()
def _display_results(console, status, results, ui_results_url) -> None:Parameters
console
required:Console
status
required:str
results
required:List[ScoringResult]
ui_results_url
required:str
Returns
None