Judgment Agent
An agent for investigating traces, sharpening judges, debugging behaviors, and reasoning over evals.
The Judgment Agent is an AI agent built into the Judgment platform. It reads the current page, accesses traces, sessions, judges, behaviors, and automations directly, and runs multi-step investigations against project data.

Context and mentions
Two sources feed every answer: the page context captured automatically, and any mentions added with @.
Page context. A snapshot of the current page attaches to every message — no re-pasting of trace IDs, judge prompts, or applied filters. Supported surfaces include project home, traces, sessions, dashboards, monitoring views, judge editors, behavior pages, test runs, and test comparisons. The empty state surfaces page-specific suggested prompts.
@ mentions. Type @ to pin a judge, behavior, or automation into the conversation. Mentions persist across turns and the agent reuses them in tool calls. Inside a trace, the focused span attaches automatically through the page snapshot.

Text quotes. Select text anywhere in the product and trigger Ask Judgment to attach the selection as a quoted reference.
Modes and permissions
Deep Research is the default. It plans a multi-step investigation: searches across traces, scores examples, cross-references behaviors, and stitches evidence together before answering. Right for root-cause analysis, judge prompt iteration, and test run comparisons.
Fast is the lighter alternative. Single-pass answers grounded in the current page snapshot plus a small number of tool calls. Right for triage, summaries, and quick questions.
Tool permissions control whether write tools require approval:
- Ask for writes (default) — confirm before tools create, update, or delete data.
- Auto-allow writes — write tools run without an approval card.
Common workflows
Start broad from the home page
Top-down questions work from the project home, even without knowing where to look.
- What changed in the project this week?
- What should I investigate first?
- Surface the top behaviors and traces worth attention.
The agent fans out across recent traces, behaviors, automations, and test runs, then surfaces the specific entities to drill into. Cited entities are clickable and carry the conversation forward with new page context.
Investigate a trace
Skip the manual span-by-span scan.
- What looks suspicious in this trace and why does it matter?
- Explain the focused span in plain language.
- Investigate this behavior. Find supporting evidence or describe what would flip it to true.
Answers cite specific spans by ID, clickable to jump in the trace tree.

Improve a judge rubric
Sharpen a rubric against real trace evidence — no manual sampling.
- Help me improve my judge prompt. Score relevant traces, ask a clarifying question if needed, then suggest a stronger updated prompt.
The agent searches recent traces, picks representative examples, scores them with the current rubric, and proposes a rewrite with citations. The draft opens in the prompt editor pre-filled, and Verdict Review shows how the new rubric would re-score recent traces before saving.

Debug a noisy behavior
Diagnose behaviors that fire too much, not enough, or inconsistently.
- Why is the detection rate for this behavior low?
- Suggest concrete ways to make this behavior more reliable.
The agent samples recent firings, classifies failure modes, and explains the gap between what the judge looks for and what the traces contain. Rubric fixes apply via the draft handoff.
Compare test runs
- Compare these two test runs. Summarize the overall difference and cite the strongest example pairs.
- Which examples regressed?
Output: a structured diff with overall delta, example-level regressions with clickable IDs, and scorer-level disagreements.
Triage a monitoring dashboard
- Summarize this dashboard and highlight the most important signals.
- Which chart should I drill into first?
The agent reads active filters, time range, and visible panels. Citations link straight into the trace or session view.