Use Cases

The questions Judgment Agent handles best. Each one ends with a concrete output ready to ship.

Start broad from the home page

Top-down questions work from the project home, even without knowing where to look.

What changed in the project this week?
Where are the agents struggling the most right now?
What should I investigate first?
Surface the top behaviors and traces worth attention.

The agent fans out across recent traces, behaviors, automations, and test runs, then surfaces the specific entities to drill into. Any cited entity is clickable and carries the conversation forward with that page's context.

Investigate a trace

Skip the manual span-by-span scan.

What looks suspicious in this trace and why does it matter?
Explain the focused span in plain language.
Investigate this behavior. Find supporting evidence or describe what would flip it to true.

Answers cite specific spans by ID, clickable to jump in the trace tree. For binary false behaviors, the agent treats the result as missing evidence and describes exactly what would change the verdict.

Judgment Agent investigating a trace with span citations

Improve a judge prompt

Sharpen a rubric against real trace evidence. No manual sampling.

Help me improve my judge prompt. Score relevant traces, ask a clarifying question if needed, then suggest a stronger updated prompt.

The agent searches recent traces, picks representative examples, scores them with the current rubric, and proposes a rewrite with citations to the traces that drove each change. The draft opens in the prompt editor pre-filled, and Verdict Review shows how the new rubric would re-score recent traces against the old one before saving.

Judgment Agent proposing a rubric change on a judge detail page

Debug a noisy behavior

Diagnose behaviors that fire too much, not enough, or inconsistently.

Why is the detection rate for this behavior low?
Show common failure patterns in traces where this behavior fires.
Suggest concrete ways to make this behavior more reliable.

The agent samples recent firings, classifies failure modes, and explains the gap between what the judge looks for and what the traces actually contain. Rubric fixes apply via the draft handoff.

Compare two test runs

Surface regressions between runs without reading every example.

Compare these two test runs. Summarize the overall difference and cite the strongest example pairs.
Which examples regressed?
Where do scorers disagree most?

Output: a structured diff with overall delta, example-level regressions with clickable IDs, and scorer-level disagreements.

Triage a monitoring dashboard

Find the signal in a noisy view.

Summarize this dashboard and highlight the most important signals.
What changed most recently?
Which chart should I drill into first?

The agent reads active filters, time range, and visible panels. Citations link straight into the trace or session view, where the conversation continues with new context.

Next steps

Context & Mentions. How the agent reads the page and what @ can pull in.
Modes. When to escalate from Fast to Deep Research.
MCP Server. The same toolset exposed to external AI editors.