MCP Server
Use Judgment's MCP server to query traces, behaviors, sessions, and more directly from your AI code editor
The Judgment MCP server exposes your production data — traces, sessions, behaviors, judges, projects, and automations — directly to AI-powered code editors via the Model Context Protocol. This lets your AI assistant query real production data, analyze agent performance, create and manage behaviors and automations, and use those insights to optimize your code — all without leaving your editor.
Setup
Connect the MCP Server
The Judgment MCP server supports two authentication methods:
- OAuth 2.1 + PKCE (recommended) — Your editor opens a browser window where you sign in and authorize access. After completing the flow, your editor holds an access token automatically — no API keys or headers needed.
- API Key — Pass your Judgment API key as a Bearer token.
Add the following to your ~/.cursor/mcp.json (global) or .cursor/mcp.json (project-level):
OAuth (recommended)
{
"mcpServers": {
"judgment-mcp": {
"url": "https://mcp.judgmentlabs.ai"
}
}
}Cursor will detect the OAuth server and prompt you to authorize in the browser on first use.
API Key
{
"mcpServers": {
"judgment-mcp": {
"url": "https://mcp.judgmentlabs.ai",
"headers": {
"Authorization": "Bearer <YOUR_JUDGMENT_API_KEY>"
}
}
}
}Run the following command to add the Judgment MCP server:
OAuth (recommended)
claude mcp add judgment-mcp \
--transport http \
https://mcp.judgmentlabs.aiClaude Code will open a browser window to complete authorization the first time a tool is called.
API Key
claude mcp add judgment-mcp \
--transport http \
--url https://mcp.judgmentlabs.ai \
--header "Authorization: Bearer <YOUR_JUDGMENT_API_KEY>"Add the following to your ~/.codeium/windsurf/mcp_config.json:
OAuth (recommended)
{
"mcpServers": {
"judgment-mcp": {
"serverUrl": "https://mcp.judgmentlabs.ai"
}
}
}Windsurf will detect the OAuth server and prompt you to authorize in the browser on first use.
API Key
{
"mcpServers": {
"judgment-mcp": {
"serverUrl": "https://mcp.judgmentlabs.ai",
"headers": {
"Authorization": "Bearer <YOUR_JUDGMENT_API_KEY>"
}
}
}
}Add the following to your ~/.codex/config.toml:
OAuth (recommended)
[mcp_servers.judgment-mcp]
url = "https://mcp.judgmentlabs.ai"Or add an MCP server with the Codex CLI:
codex mcp add judgment-mcp \
-- npx -y mcp-remote https://mcp.judgmentlabs.aiCodex will detect the OAuth server and prompt you to authorize in the browser on first use.
API Key
[mcp_servers.judgment-mcp]
url = "https://mcp.judgmentlabs.ai"
bearer_token_env_var = "JUDGMENT_API_KEY"Or via the CLI:
codex mcp add judgment-mcp \
-- npx -y mcp-remote https://mcp.judgmentlabs.ai \
--header "Authorization: Bearer <YOUR_JUDGMENT_API_KEY>"Add the Best Practices Skill
To help your AI assistant use the MCP server effectively, add the Judgment MCP best practices skill. This teaches your assistant optimal patterns like batching queries, using full-text search first, and deduplicating results.
mkdir -p .cursor/skills/judgment-mcp
curl -fLo .cursor/skills/judgment-mcp/SKILL.md \
https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.mdmkdir -p .claude/skills/mcp-server-best-practices
curl -fLo .claude/skills/mcp-server-best-practices/SKILL.md \
https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md mkdir -p .windsurf/skills/judgment-mcp
curl -fLo .windsurf/skills/judgment-mcp/SKILL.md \
https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.md mkdir -p .codex/skills/judgment-mcp
curl -fLo .codex/skills/judgment-mcp/SKILL.md \
https://docs.judgmentlabs.ai/skills/mcp-server-best-practices.mdWhat You Can Do
Once connected, your AI assistant can query and manage Judgment data through natural language. Here are some examples:
- "Show me the slowest traces from the last 24 hours"
- "Find traces where users asked about billing"
- "What behaviors were detected in session X?"
- "List all automations configured for this project"
- "Show me traces with errors that cost more than $0.10"
- "Create a binary behavior that checks whether responses are on-topic"
- "Re-evaluate all traces in the project with the Relevance judge"
- "Add a 'reviewed' tag to trace XYZ"
- "Create an automation that alerts when error rate exceeds 10%"
- "Write a memory file with key facts the agent should remember across sessions"
- "List all agent memory entries for this project"
- "Run a test on the last 50 traces with my Accuracy judge"
- "Show me the test results for my latest experiment run"
- "Add these traces to the golden-path dataset"
- "Create a new dataset called 'edge-cases' and add these examples to it"
Use Production Data to Optimize Your Code
Beyond querying data, the MCP server enables a powerful workflow: use real production insights to improve your agent code. Your AI assistant can:
- Find failing patterns — search traces for errors, high latency, or unexpected behaviors, then fix the underlying code
- Analyze behavior trends — check which behaviors are firing most often and adjust your prompts or logic accordingly
- Optimize costs — identify expensive traces and refactor the agent flows that produce them
- Debug sessions — walk through an entire user session's traces to understand where things went wrong
Available Tools
The MCP server provides 72 tools organized into fourteen categories. Every tool except list_organizations takes an organization_id argument that selects the organization the call operates on.
Organizations
| Tool | Description |
|---|---|
list_organizations | List the organizations the authenticated user is a member of. Use the returned organization_id as input to all other MCP tools. |
Projects
| Tool | Description |
|---|---|
list_projects | List all projects in your organization with summary stats (datasets, experiment runs, traces, behaviors). |
create_project | Create a new project in your organization. Requires the developer role. |
add_project_favorite | Mark a project as a favorite so it appears pinned in the UI. |
remove_project_favorite | Remove a project from your favorites. |
Traces
| Tool | Description |
|---|---|
search_traces | Batch search up to 10 queries per call. Filter by duration, error, span name, customer ID, session ID, tags, LLM cost, behaviors, scores, or full-text search. Sort by created_at (default), span_name, duration, or llm_cost; non-created_at-desc sorts require time_range.start_time and a window of at most 7 days. |
get_trace_detail | Get duration, cost, and session info for a single trace. |
get_trace_spans | Get all spans for a trace. |
get_trace_span | Batch get span details (including scores and annotations) for up to 20 trace/span pairs. |
get_trace_tags | Get tags for a trace. |
get_trace_behaviors | Get behavior results (binary/categorical scores) for a trace. |
add_trace_tags | Attach one or more string tags to an existing trace. Tags are additive — existing tags are preserved. Requires the developer role. |
evaluate_traces | Trigger online evaluation for specific traces (up to 100). Optionally restrict to named judges. Requires the developer role. |
evaluate_all_traces | Trigger online evaluation for all recent traces (up to a configurable limit, default 1000). Optionally restrict to named judges. Requires the developer role. |
Sessions
| Tool | Description |
|---|---|
search_sessions | Search and filter sessions by session ID, trace count, latency, total cost, or behaviors. Supports time ranges, sorting, and pagination. |
get_session_detail | Get session timestamps, trace count, latency, cost, and token usage. |
get_session_trace_ids | Get all trace IDs in a session. |
get_session_trace_behaviors | Get behaviors detected across traces in a session, grouped by behavior. |
Behaviors
| Tool | Description |
|---|---|
list_behaviors | List all behaviors for the project with stats. |
get_behavior_detail | Get full details for a behavior including scorer prompt, configuration, and stats. |
create_binary_behavior | Create a binary (yes/no) behavior. The judge LLM uses your prompt to decide true/false on each qualifying span. Requires the developer role. |
create_classifier_behavior | Create a classifier (multi-label) behavior. The judge LLM picks one of the supplied options for each qualifying span. Requires the developer role. |
update_behavior | Update a behavior's description. Requires the developer role. |
delete_behavior | Delete a behavior. Optionally also deletes the underlying scorer. Requires the admin role. |
Judges
| Tool | Description |
|---|---|
list_judges | List every judge in a project, including prompt, code, and custom (uploaded) judges with their current configuration and online-evaluation settings. |
get_judge | Get full detail for a single judge, including all versions, prompts, categories, and online-evaluation settings. |
list_judge_models | List the models available for use as the LLM backing a judge. |
create_judge | Create a new prompt judge in a project. Specify a name, model, prompt, and score type (binary, numeric, or categorical). Requires the developer role. |
update_judge | Update a judge — model, prompt, description, score type, categories, score bounds, agent prompts, or version metadata. Pass target_major_version/target_minor_version to update a specific version; otherwise the latest version is updated. Requires the developer role. |
set_judge_tag | Add or remove a version tag (e.g. prod) on a specific version of a judge. Requires the developer role. |
delete_judges | Delete one or more judges by ID. Behaviors that reference these judges are also removed. Requires the developer role. |
get_judge_settings | Get advanced evaluation settings for a judge. |
update_judge_settings | Update how often and on which spans a judge runs online. Set evaluation_mode: continuous with a sampling rate for automatic evaluation, or on_demand for manual invocation. Requires the developer role. |
Prompts
| Tool | Description |
|---|---|
list_prompts | List all prompts in a project with their version count and last updated timestamp. |
get_prompt | Get the content and metadata for a specific prompt version. Returns the latest version by default; optionally pass commit_id or tag to fetch a specific version. |
get_prompt_versions | List all committed versions of a prompt, ordered newest first, including tags and author info. |
commit_prompt | Commit a new version of a prompt. Creates the prompt if it does not exist yet. Optionally apply tags (e.g. prod) to the new version. Requires the developer role. |
tag_prompt | Add one or more tags to a specific version (commit) of a prompt. Tags like prod or staging let you pin a version for retrieval. Requires the developer role. |
untag_prompt | Remove one or more tags from a prompt. The underlying versions are not deleted. Requires the developer role. |
Agents
| Tool | Description |
|---|---|
create_agent | Create a custom agent config for a project. Accepts a name, optional description, optional instructions, an optional scheduled trigger (daily or weekly at a given hour/minute, with optional dayOfWeek or daysOfWeek and timezone), and an optional slack delivery config ({ enabled, channels }) that posts the final output of each scheduled run to the listed Slack channel IDs. Slack delivery requires an enabled scheduled trigger and the organization to have Slack connected. Requires the developer role. |
Agent Threads
| Tool | Description |
|---|---|
list_agent_threads | List agent thread conversations for the authenticated user in a project, including title, type, message count, run status, and timestamps. Optionally filter by agent kind (global_copilot, custom_agent) and limit results (max 100). |
get_agent_thread | Get a single agent thread conversation including its full message transcript, metadata, active run status, and timestamps. |
set_agent_thread_project | Assign a project to an agent thread that was created without one (for example, from a Slack mention). The target project must belong to the same organization. Requires the developer role. |
ask_judgment_agent | Ask Judgment Agent a question by starting a durable agent thread run. Creates a new thread or continues an existing one (thread_id). Defaults to the global_copilot agent; pass agent_type: custom_agent with an agent_name to target a custom agent. Returns thread_id and run_id immediately; poll with get_judgment_agent_run for the answer. |
get_judgment_agent_run | Get the status and completed answer for a Judgment Agent run started by ask_judgment_agent. Returns the assistant answer scoped to the requested run_id once the run is completed, along with any last_error. |
Datasets
| Tool | Description |
|---|---|
list_datasets | List all datasets in a project with entry counts and version info. Use the returned dataset_id to filter traces by dataset via search_traces. |
get_dataset | Get dataset details including paginated examples with their trace IDs and data. Optionally specify a version number to view a historical snapshot. |
get_dataset_versions | Get version history for a dataset including item counts per version. |
get_dataset_item_ids | Get all example IDs in a dataset, optionally filtered by version. |
create_dataset | Create a new dataset in a project. Returns the new dataset ID. Requires the developer role. |
add_traces_to_dataset | Add traces to one or more datasets. Creates example entries from the traces and increments the dataset version. Requires the developer role. |
add_examples_to_dataset | Add existing examples to one or more datasets, incrementing the dataset version. Requires the developer role. |
delete_dataset | Delete a dataset. Fails if the dataset is still referenced by other resources. Requires the admin role. |
bulk_delete_datasets | Delete multiple datasets at once. Fails if any dataset is still referenced by other resources. Requires the admin role. |
Tests
| Tool | Description |
|---|---|
list_tests | List judgment test runs for a project with aggregate judge score summaries. Supports limit and offset pagination. |
get_test | Get metadata for a single judgment test run by its experiment run ID. |
get_test_example_items | Get the per-example table for a judgment test run, including example data and judge scores. |
get_test_live_results | Get live per-example evaluation status and partial results for a queued judgment test run. Use this for streaming progress before the final results are written. |
get_test_graph | Get aggregate score graph data for a judgment test run, keyed by judge. |
run_test | Queue a judgment test run over existing dataset/example IDs. Examples are loaded from storage before evaluation. Use run_test_on_traces when starting from monitoring trace IDs. Requires the developer role. |
run_test_on_traces | Queue a judgment test run over existing monitoring traces, scored by a fully specified ephemeral draft judge. Traces are copied to offline storage before evaluation. Returns the experiment run ID plus evaluation run mappings for live polling. Requires the developer role. |
Automations
| Tool | Description |
|---|---|
list_automations | List all automations with their conditions, actions, and active status. |
get_automation | Get a single automation by ID. |
create_automation | Create an automation that watches behavior/latency/cost metrics and fires actions when conditions match. Requires the developer role. |
update_automation | Update an existing automation. All fields other than the IDs are optional. Use active: true/false to enable or disable. Requires the developer role. |
delete_automation | Delete an automation. Requires the admin role. |
Agent Memory
| Tool | Description |
|---|---|
list_agent_memory_entries | List all Agent Memory entries (folders and files) for a project. |
fetch_agent_memory_files | Fetch Agent Memory files by ID and/or path. Folders are ignored and files are returned in request order. Accepts up to 200 IDs or paths per call. |
search_agent_memory_files | Search Agent Memory files by path and body. Returns concise file references and snippets — call fetch_agent_memory_files with the result IDs or paths for full bodies. Optional limit (max 20). |
write_memory | Create or update an Agent Memory file by path. Use this when an agent learns durable project context that should be available in future sessions. Requires the developer role. |
Documentation
| Tool | Description |
|---|---|
search_docs | Hybrid semantic + keyword search over Judgment documentation. Returns matching doc sections with titles, headings, paths, and content snippets. |
read_doc_page | Read the full markdown content of a Judgment documentation page by path. |