MCP Server

The Judgment MCP server exposes your production data — traces, sessions, behaviors, judges, projects, and automations — directly to AI-powered code editors via the Model Context Protocol. This lets your AI assistant query real production data, analyze agent performance, create and manage behaviors and automations, and use those insights to optimize your code — all without leaving your editor.

The same toolset is available inside the Judgment platform itself. See the Judgment Agent, the in-product AI agent that already has these tools wired up plus full page context.

Setup

Connect the MCP Server

The Judgment MCP server supports two authentication methods:

OAuth 2.1 + PKCE (recommended) — Your editor opens a browser window where you sign in and authorize access. After completing the flow, your editor holds an access token automatically — no API keys or headers needed.
API Key — Pass your Judgment API key as a Bearer token.

Add the following to your ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-level):

OAuth (recommended)

{
  "mcpServers": {
    "judgment-mcp": {
      "url": "https://mcp.judgmentlabs.ai"
    }
  }
}

Cursor will detect the OAuth server and prompt you to authorize in the browser on first use.

API Key

{
  "mcpServers": {
    "judgment-mcp": {
      "url": "https://mcp.judgmentlabs.ai",
      "headers": {
        "Authorization": "Bearer <YOUR_JUDGMENT_API_KEY>"
      }
    }
  }
}

Run the following command to add the Judgment MCP server:

OAuth (recommended)

claude mcp add judgment-mcp \
  --transport http \
  https://mcp.judgmentlabs.ai

Claude Code will open a browser window to complete authorization the first time a tool is called.

API Key

claude mcp add judgment-mcp \
  --transport http \
  --url https://mcp.judgmentlabs.ai \
  --header "Authorization: Bearer <YOUR_JUDGMENT_API_KEY>"

Add the following to your ~/.codeium/windsurf/mcp_config.json:

OAuth (recommended)

{
  "mcpServers": {
    "judgment-mcp": {
      "serverUrl": "https://mcp.judgmentlabs.ai"
    }
  }
}

Windsurf will detect the OAuth server and prompt you to authorize in the browser on first use.

API Key

{
  "mcpServers": {
    "judgment-mcp": {
      "serverUrl": "https://mcp.judgmentlabs.ai",
      "headers": {
        "Authorization": "Bearer <YOUR_JUDGMENT_API_KEY>"
      }
    }
  }
}

Add the following to your ~/.codex/config.toml:

OAuth (recommended)

[mcp_servers.judgment-mcp]
url = "https://mcp.judgmentlabs.ai"

Or add an MCP server with the Codex CLI:

codex mcp add judgment-mcp \
  -- npx -y mcp-remote https://mcp.judgmentlabs.ai

Codex will detect the OAuth server and prompt you to authorize in the browser on first use.

API Key

[mcp_servers.judgment-mcp]
url = "https://mcp.judgmentlabs.ai"
bearer_token_env_var = "JUDGMENT_API_KEY"

Or via the CLI:

codex mcp add judgment-mcp \
  -- npx -y mcp-remote https://mcp.judgmentlabs.ai \
  --header "Authorization: Bearer <YOUR_JUDGMENT_API_KEY>"

Add the Best Practices Skill

To help your AI assistant use the MCP server effectively, add the Judgment MCP best practices skill. This teaches your assistant optimal patterns like batching queries, using full-text search first, and deduplicating results.

mkdir -p .cursor/skills/judgment-mcp
curl -fLo .cursor/skills/judgment-mcp/SKILL.md \
  https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md

mkdir -p .claude/skills/mcp-server-best-practices 
curl -fLo .claude/skills/mcp-server-best-practices/SKILL.md \
  https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md

mkdir -p .windsurf/skills/judgment-mcp 
curl -fLo .windsurf/skills/judgment-mcp/SKILL.md \
  https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md

mkdir -p .codex/skills/judgment-mcp
curl -fLo .codex/skills/judgment-mcp/SKILL.md \
  https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md

What You Can Do

Once connected, your AI assistant can query and manage Judgment data through natural language. Here are some examples:

"Show me the slowest traces from the last 24 hours"
"Find traces where users asked about billing"
"What behaviors were detected in session X?"
"List all automations configured for this project"
"Show me traces with errors that cost more than $0.10"
"Create a binary behavior that checks whether responses are on-topic"
"Re-evaluate all traces in the project with the Relevance judge"
"Add a 'reviewed' tag to trace XYZ"
"Create an automation that alerts when error rate exceeds 10%"
"Write a memory file with key facts the agent should remember across sessions"
"List all agent memory entries for this project"
"Run a test on the last 50 traces with my Accuracy judge"
"Show me the test results for my latest experiment run"
"Add these traces to the golden-path dataset"
"Create a new dataset called 'edge-cases' and add these examples to it"

Use Production Data to Optimize Your Code

Beyond querying data, the MCP server enables a powerful workflow: use real production insights to improve your agent code. Your AI assistant can:

Find failing patterns — search traces for errors, high latency, or unexpected behaviors, then fix the underlying code
Analyze behavior trends — check which behaviors are firing most often and adjust your prompts or logic accordingly
Optimize costs — identify expensive traces and refactor the agent flows that produce them
Debug sessions — walk through an entire user session's traces to understand where things went wrong

Available Tools

The MCP server provides 72 tools organized into fourteen categories. Every tool except list_organizations takes an organization_id argument that selects the organization the call operates on.

Organizations

Tool	Description
`list_organizations`	List the organizations the authenticated user is a member of. Use the returned `organization_id` as input to all other MCP tools.

Projects

Tool	Description
`list_projects`	List all projects in your organization with summary stats (datasets, experiment runs, traces, behaviors).
`create_project`	Create a new project in your organization. Requires the developer role.
`add_project_favorite`	Mark a project as a favorite so it appears pinned in the UI.
`remove_project_favorite`	Remove a project from your favorites.

Traces

Tool	Description
`search_traces`	Batch search up to 10 queries per call. Filter by duration, error, span name, customer ID, session ID, tags, LLM cost, behaviors, scores, or full-text search. Sort by `created_at` (default), `span_name`, `duration`, or `llm_cost`; non-`created_at`-desc sorts require `time_range.start_time` and a window of at most 7 days.
`get_trace_detail`	Get duration, cost, and session info for a single trace.
`get_trace_spans`	Get all spans for a trace.
`get_trace_span`	Batch get span details (including scores and annotations) for up to 20 trace/span pairs.
`get_trace_tags`	Get tags for a trace.
`get_trace_behaviors`	Get behavior results (binary/categorical scores) for a trace.
`add_trace_tags`	Attach one or more string tags to an existing trace. Tags are additive — existing tags are preserved. Requires the developer role.
`evaluate_traces`	Trigger online evaluation for specific traces (up to 100). Optionally restrict to named judges. Requires the developer role.
`evaluate_all_traces`	Trigger online evaluation for all recent traces (up to a configurable limit, default 1000). Optionally restrict to named judges. Requires the developer role.

Sessions

Tool	Description
`search_sessions`	Search and filter sessions by session ID, trace count, latency, total cost, or behaviors. Supports time ranges, sorting, and pagination.
`get_session_detail`	Get session timestamps, trace count, latency, cost, and token usage.
`get_session_trace_ids`	Get all trace IDs in a session.
`get_session_trace_behaviors`	Get behaviors detected across traces in a session, grouped by behavior.

Behaviors

Tool	Description
`list_behaviors`	List all behaviors for the project with stats.
`get_behavior_detail`	Get full details for a behavior including scorer prompt, configuration, and stats.
`create_binary_behavior`	Create a binary (yes/no) behavior. The judge LLM uses your prompt to decide true/false on each qualifying span. Requires the developer role.
`create_classifier_behavior`	Create a classifier (multi-label) behavior. The judge LLM picks one of the supplied options for each qualifying span. Requires the developer role.
`update_behavior`	Update a behavior's description. Requires the developer role.
`delete_behavior`	Delete a behavior. Optionally also deletes the underlying scorer. Requires the admin role.

Judges

Tool	Description
`list_judges`	List every judge in a project, including prompt, code, and custom (uploaded) judges with their current configuration and online-evaluation settings.
`get_judge`	Get full detail for a single judge, including all versions, prompts, categories, and online-evaluation settings.
`list_judge_models`	List the models available for use as the LLM backing a judge.
`create_judge`	Create a new prompt judge in a project. Specify a name, model, prompt, and score type (`binary`, `numeric`, or `categorical`). Requires the developer role.
`update_judge`	Update a judge — model, prompt, description, score type, categories, score bounds, agent prompts, or version metadata. Pass `target_major_version`/`target_minor_version` to update a specific version; otherwise the latest version is updated. Requires the developer role.
`set_judge_tag`	Add or remove a version tag (e.g. `prod`) on a specific version of a judge. Requires the developer role.
`delete_judges`	Delete one or more judges by ID. Behaviors that reference these judges are also removed. Requires the developer role.
`get_judge_settings`	Get advanced evaluation settings for a judge.
`update_judge_settings`	Update how often and on which spans a judge runs online. Set `evaluation_mode: continuous` with a sampling rate for automatic evaluation, or `on_demand` for manual invocation. Requires the developer role.

Prompts

Tool	Description
`list_prompts`	List all prompts in a project with their version count and last updated timestamp.
`get_prompt`	Get the content and metadata for a specific prompt version. Returns the latest version by default; optionally pass `commit_id` or `tag` to fetch a specific version.
`get_prompt_versions`	List all committed versions of a prompt, ordered newest first, including tags and author info.
`commit_prompt`	Commit a new version of a prompt. Creates the prompt if it does not exist yet. Optionally apply tags (e.g. `prod`) to the new version. Requires the developer role.
`tag_prompt`	Add one or more tags to a specific version (commit) of a prompt. Tags like `prod` or `staging` let you pin a version for retrieval. Requires the developer role.
`untag_prompt`	Remove one or more tags from a prompt. The underlying versions are not deleted. Requires the developer role.

Agents

Tool Description

create_agent Create a custom agent config for a project. Accepts a name, optional description, optional instructions, an optional scheduled trigger (daily or weekly at a given hour/minute, with optional dayOfWeek or daysOfWeek and timezone), and an optional slack delivery config ({ enabled, channels }) that posts the final output of each scheduled run to the listed Slack channel IDs. Slack delivery requires an enabled scheduled trigger and the organization to have Slack connected. Requires the developer role.

Tool	Description
`create_agent`	Create a custom agent config for a project. Accepts a name, optional description, optional instructions, an optional scheduled trigger (daily or weekly at a given hour/minute, with optional `dayOfWeek` or `daysOfWeek` and timezone), and an optional `slack` delivery config (`{ enabled, channels }`) that posts the final output of each scheduled run to the listed Slack channel IDs. Slack delivery requires an enabled scheduled trigger and the organization to have Slack connected. Requires the developer role.

Agent Threads

Tool	Description
`list_agent_threads`	List agent thread conversations for the authenticated user in a project, including title, type, message count, run status, and timestamps. Optionally filter by agent kind (`global_copilot`, `custom_agent`) and limit results (max 100).
`get_agent_thread`	Get a single agent thread conversation including its full message transcript, metadata, active run status, and timestamps.
`set_agent_thread_project`	Assign a project to an agent thread that was created without one (for example, from a Slack mention). The target project must belong to the same organization. Requires the developer role.
`ask_judgment_agent`	Ask Judgment Agent a question by starting a durable agent thread run. Creates a new thread or continues an existing one (`thread_id`). Defaults to the `global_copilot` agent; pass `agent_type: custom_agent` with an `agent_name` to target a custom agent. Returns `thread_id` and `run_id` immediately; poll with `get_judgment_agent_run` for the answer.
`get_judgment_agent_run`	Get the status and completed answer for a Judgment Agent run started by `ask_judgment_agent`. Returns the assistant answer scoped to the requested `run_id` once the run is `completed`, along with any `last_error`.

Datasets

Tool	Description
`list_datasets`	List all datasets in a project with entry counts and version info. Use the returned `dataset_id` to filter traces by dataset via `search_traces`.
`get_dataset`	Get dataset details including paginated examples with their trace IDs and data. Optionally specify a `version` number to view a historical snapshot.
`get_dataset_versions`	Get version history for a dataset including item counts per version.
`get_dataset_item_ids`	Get all example IDs in a dataset, optionally filtered by version.
`create_dataset`	Create a new dataset in a project. Returns the new dataset ID. Requires the developer role.
`add_traces_to_dataset`	Add traces to one or more datasets. Creates example entries from the traces and increments the dataset version. Requires the developer role.
`add_examples_to_dataset`	Add existing examples to one or more datasets, incrementing the dataset version. Requires the developer role.
`delete_dataset`	Delete a dataset. Fails if the dataset is still referenced by other resources. Requires the admin role.
`bulk_delete_datasets`	Delete multiple datasets at once. Fails if any dataset is still referenced by other resources. Requires the admin role.

Tests

Tool	Description
`list_tests`	List judgment test runs for a project with aggregate judge score summaries. Supports `limit` and `offset` pagination.
`get_test`	Get metadata for a single judgment test run by its experiment run ID.
`get_test_example_items`	Get the per-example table for a judgment test run, including example data and judge scores.
`get_test_live_results`	Get live per-example evaluation status and partial results for a queued judgment test run. Use this for streaming progress before the final results are written.
`get_test_graph`	Get aggregate score graph data for a judgment test run, keyed by judge.
`run_test`	Queue a judgment test run over existing dataset/example IDs. Examples are loaded from storage before evaluation. Use `run_test_on_traces` when starting from monitoring trace IDs. Requires the developer role.
`run_test_on_traces`	Queue a judgment test run over existing monitoring traces, scored by a fully specified ephemeral draft judge. Traces are copied to offline storage before evaluation. Returns the experiment run ID plus evaluation run mappings for live polling. Requires the developer role.

Automations

Tool	Description
`list_automations`	List all automations with their conditions, actions, and active status.
`get_automation`	Get a single automation by ID.
`create_automation`	Create an automation that watches behavior/latency/cost metrics and fires actions when conditions match. Requires the developer role.
`update_automation`	Update an existing automation. All fields other than the IDs are optional. Use `active: true/false` to enable or disable. Requires the developer role.
`delete_automation`	Delete an automation. Requires the admin role.

Agent Memory

Tool	Description
`list_agent_memory_entries`	List all Agent Memory entries (folders and files) for a project.
`fetch_agent_memory_files`	Fetch Agent Memory files by ID and/or path. Folders are ignored and files are returned in request order. Accepts up to 200 IDs or paths per call.
`search_agent_memory_files`	Search Agent Memory files by path and body. Returns concise file references and snippets — call `fetch_agent_memory_files` with the result IDs or paths for full bodies. Optional `limit` (max 20).
`write_memory`	Create or update an Agent Memory file by path. Use this when an agent learns durable project context that should be available in future sessions. Requires the developer role.

Documentation

Tool	Description
`search_docs`	Hybrid semantic + keyword search over Judgment documentation. Returns matching doc sections with titles, headings, paths, and content snippets.
`read_doc_page`	Read the full markdown content of a Judgment documentation page by path.

MCP Server

On this page