Judgment Labs Logo

Judgment Agent

In-product AI agent for investigating traces, sharpening judges, debugging behaviors, and reasoning over evals.

The Judgment Agent is an AI agent built into the Judgment platform. It reads the current page, accesses traces, sessions, judges, behaviors, and automations directly, and runs multi-step investigations against project data.

Example questions:

  • Why did this trace fail?
  • What should I investigate first today?
  • Improve this judge prompt against recent traces.
  • What regressed between these two test runs?
  • Which chart on this dashboard matters most right now?
Judgment Agent answering a question over a trace

Knows the data

Judgment Agent uses the same tools as the MCP Server. It searches traces, scores examples, and reads judges, behaviors, and automations.

  • Cites real spans, traces, and behaviors in every answer
  • Reuses the page snapshot already on screen
  • Reasons over many traces, not just the focused one

Works top-down or bottom-up

Broad questions from the project home, or focused work from a specific trace, judge, or behavior. Both flows work.

  • "What should I investigate first?" from project home fans out across recent activity
  • Cited entities are clickable; the conversation carries over with the new page context
  • The same thread works whether the question stays general or drills into specifics

Hands work back as drafts

Rubric changes and behavior tweaks land as reviewable drafts, never silent mutations.

  • Rubric drafts open in the judge editor pre-filled
  • Verdict review shows old vs. new scoring side-by-side
  • Nothing persists until accepted

Goes deep on demand

Deep Research mode plans multi-step investigations, scores representative examples, and stitches evidence together.

  • Fast mode for triage and one-shot answers
  • Deep Research for root-cause analysis across many traces
  • Tool permissions configurable per session

Next steps

  • Use Cases. Concrete flows: trace investigation, judge iteration, behavior debugging, eval comparison, dashboard triage.
  • Context & Mentions. How the agent reads the page and what @ can pull in.
  • Modes. Fast vs. Deep Research and tool-permission control.

On this page