Judgment Labs Logo

Judgment Agent

An agent for investigating traces, sharpening judges, debugging behaviors, and reasoning over evals.

The Judgment Agent is an AI agent built into the Judgment platform. It reads the current page, accesses traces, sessions, judges, behaviors, and automations directly, and runs multi-step investigations against project data.

Judgment Agent answering a question over a trace

Context and mentions

Two sources feed every answer: the page context captured automatically, and any mentions added with @.

Page context. A snapshot of the current page attaches to every message — no re-pasting of trace IDs, judge prompts, or applied filters. Supported surfaces include project home, traces, sessions, dashboards, monitoring views, judge editors, behavior pages, test runs, and test comparisons. The empty state surfaces page-specific suggested prompts.

@ mentions. Type @ to pin a judge, behavior, or automation into the conversation. Mentions persist across turns and the agent reuses them in tool calls. Inside a trace, the focused span attaches automatically through the page snapshot.

Mention picker open showing judges, behaviors, and automations

Text quotes. Select text anywhere in the product and trigger Ask Judgment to attach the selection as a quoted reference.

Modes and permissions

Deep Research is the default. It plans a multi-step investigation: searches across traces, scores examples, cross-references behaviors, and stitches evidence together before answering. Right for root-cause analysis, judge prompt iteration, and test run comparisons.

Fast is the lighter alternative. Single-pass answers grounded in the current page snapshot plus a small number of tool calls. Right for triage, summaries, and quick questions.

Tool permissions control whether write tools require approval:

  • Ask for writes (default) — confirm before tools create, update, or delete data.
  • Auto-allow writes — write tools run without an approval card.

Even with auto-allow, the agent never silently mutates judges, behaviors, or rubrics. Edits land as reviewable drafts in the relevant UI. Nothing persists until accepted.

Common workflows

Start broad from the home page

Top-down questions work from the project home, even without knowing where to look.

  • What changed in the project this week?
  • What should I investigate first?
  • Surface the top behaviors and traces worth attention.

The agent fans out across recent traces, behaviors, automations, and test runs, then surfaces the specific entities to drill into. Cited entities are clickable and carry the conversation forward with new page context.

Investigate a trace

Skip the manual span-by-span scan.

  • What looks suspicious in this trace and why does it matter?
  • Explain the focused span in plain language.
  • Investigate this behavior. Find supporting evidence or describe what would flip it to true.

Answers cite specific spans by ID, clickable to jump in the trace tree.

Judgment Agent investigating a trace with span citations

Improve a judge rubric

Sharpen a rubric against real trace evidence — no manual sampling.

  • Help me improve my judge prompt. Score relevant traces, ask a clarifying question if needed, then suggest a stronger updated prompt.

The agent searches recent traces, picks representative examples, scores them with the current rubric, and proposes a rewrite with citations. The draft opens in the prompt editor pre-filled, and Verdict Review shows how the new rubric would re-score recent traces before saving.

Judgment Agent proposing a rubric change on a judge detail page

Debug a noisy behavior

Diagnose behaviors that fire too much, not enough, or inconsistently.

  • Why is the detection rate for this behavior low?
  • Suggest concrete ways to make this behavior more reliable.

The agent samples recent firings, classifies failure modes, and explains the gap between what the judge looks for and what the traces contain. Rubric fixes apply via the draft handoff.

Compare test runs

  • Compare these two test runs. Summarize the overall difference and cite the strongest example pairs.
  • Which examples regressed?

Output: a structured diff with overall delta, example-level regressions with clickable IDs, and scorer-level disagreements.

Triage a monitoring dashboard

  • Summarize this dashboard and highlight the most important signals.
  • Which chart should I drill into first?

The agent reads active filters, time range, and visible panels. Citations link straight into the trace or session view.