JudgmentClient API Reference

The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.

Authentication

Set up your credentials using environment variables:

export JUDGMENT_API_KEY="your_api_key_here"  
export JUDGMENT_ORG_ID="your_organization_id_here"

client.run_evaluation()

Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models.

Parameters

examplesList[Example]

Required

The examples to evaluate against your AI model

[Example(...)]

scorersList[APIJudgmentScorer]

Required

List of scorers to use for evaluation

[APIJudgmentScorer(...)]

modelstrdefault: gpt-4.1

Optional

Model used as judge when using LLM as a Judge

"gpt-4o-mini"

project_namestrdefault: default_project

Optional

Name of the project for organization

"my_qa_project"

eval_run_namestrdefault: default_eval_run

Optional

Unique name for this evaluation run

"experiment_v1"

overridebooldefault: False

Optional

Whether to override an existing evaluation run with the same name

appendbooldefault: False

Optional

Whether to append to an existing evaluation run with the same name

async_executionbooldefault: False

Optional

Whether to execute the evaluation asynchronously

Returns

A list of ScoringResult objects.

Example Code

from judgeval import JudgmentClient
from judgeval.data import Example

client = JudgmentClient()

examples = [
  Example(
      input="What is the capital of France?",
      actual_output="Paris is the capital of France.",
      expected_output="Paris"
  )
]

from judgeval.scorers import AnswerRelevancyScorer
results = client.run_evaluation(
  examples=examples,
  scorers=[AnswerRelevancyScorer(threshold=0.9)],
  project_name="geography_qa"
)

Response

[
    ScoringResult(
      success=False, 
      scorers_data=[ScorerData(...)], 
      name=None, 
      data_object=Example(...), 
      trace_id=None, 
      run_duration=None, 
      evaluation_cost=None
    )
]

client.run_trace_evaluation()

Execute trace-based evaluation using function calls and tracing to evaluate agent behavior and execution flows.

Parameters

scorersList[APIJudgmentScorer]

Required

List of scorers to use for evaluation

[APIJudgmentScorer(...)]

examplesList[Example]

Optional

Examples to run through the function (required if using function)

[Example(...)]

functionCallable

Optional

Function to execute and trace for evaluation

tracerUnion[Tracer, BaseCallbackHandler]

Optional

The tracer object used in tracing your agent

tracesList[Trace]

Optional

Pre-existing traces to evaluate instead of generating new ones

project_namestrdefault: default_project

Optional

Name of the project for organization

"agent_evaluation"

eval_run_namestrdefault: default_eval_run

Optional

Unique name for this trace evaluation run

"agent_trace_v1"

overridebooldefault: False

Optional

Whether to override an existing evaluation run with the same name

appendbooldefault: False

Optional

Whether to append to an existing evaluation run with the same name

Returns

A list of ScoringResult objects.

Example Code


from judgeval.tracer import Tracer
tracer = Tracer()

def my_agent_function(query: str) -> str:
  """Your agent function to be traced and evaluated"""
  response = f"Processing query: {query}"
  return response

examples = [
  Example(
      input={"query": "What is the weather like?"},
      expected_output="I'll help you check the weather."
  )
]

from judgeval.scorers import ToolOrderScorer    
results = client.run_trace_evaluation(
  scorers=[ToolOrderScorer()],
  examples=examples,
  function=my_agent_function,
  tracer=tracer,
  project_name="agent_evaluation"
)

Response

[
    ScoringResult(
      success=False, 
      scorers_data=[ScorerData(...)], 
      name=None, 
      data_object=Example(...), 
      trace_id=None, 
      run_duration=None, 
      evaluation_cost=None
    )
]

client.create_dataset()

Create a new evaluation dataset for storage and reuse across multiple evaluation runs.

client.push_dataset()

Upload an evaluation dataset to the Judgment platform for storage and reuse across multiple evaluation runs.

Parameters

aliasstr

Required

Unique name for the dataset within the project

"qa_dataset_v1"

datasetEvalDataset

Required

Dataset object containing examples and metadata

project_namestr

Required

Project name where the dataset will be stored

"question_answering"

overwritebooldefault: False

Optional

Whether to overwrite existing dataset with same alias

Returns

Returns True if successful.

Example Code

dataset = client.create_dataset()

dataset.add_examples([
  Example(
      input="What is machine learning?",
      actual_output="Machine learning is a subset of AI...",
      expected_output="Machine learning is a method of data analysis..."
  )
])

success = client.push_dataset(
  alias="ml_qa_dataset_v2",
  dataset=dataset,
  project_name="machine_learning_qa",
  overwrite=True
)

Response

True

client.pull_dataset()

Retrieve a saved dataset from the Judgment platform to use in evaluations or analysis.

Parameters

aliasstr

Required

The alias of the dataset to retrieve

"qa_dataset_v1"

project_namestr

Required

Project name where the dataset is stored

"question_answering"

Returns

An EvalDataset object.

Example Code

dataset = client.pull_dataset(
  alias="qa_dataset_v1",
  project_name="question_answering"
)

print(f"Dataset has {len(dataset.examples)} examples")

results = client.run_evaluation(
  examples=dataset.examples,
  scorers=my_scorers,
  project_name="question_answering"
)

Response

EvalDataset(
examples=[
  Example(
    input="What is the capital of France?",
    actual_output="Paris",
    expected_output="Paris"
  )
],
metadata={
  "created_at": "2024-01-15T10:30:00Z",
  "examples_count": 1
}
)

client.append_dataset()

Append examples to an existing dataset.

Parameters

aliasstr

Required

Unique name for the dataset within the project

"qa_dataset_v1"

examplesList[Example]

Required

List of examples to append to the dataset

[Example(...)]

project_namestr

Required

Project name where the dataset will be stored

"question_answering"

Returns

Returns True if successful.

Example Code

dataset = client.create_dataset()

dataset = client.pull_dataset(
  alias="qa_dataset_v1",
  project_name="question_answering"
)

examples = [
  Example(
    input="What is the capital of France?",
    actual_output="Paris",
    expected_output="Paris"
  )
]

results = client.append_dataset(
  alias="qa_dataset_v1",
  examples=examples,
  project_name="question_answering"
)

Response

True

client.assert_test()

Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.

Parameters

examplesList[Example]

Required

The examples to evaluate against your AI model

[Example(...)]

scorersList[APIJudgmentScorer]

Required

List of scorers to use for evaluation

[APIJudgmentScorer(...)]

modelstrdefault: gpt-4.1

Optional

Model used as judge when using LLM as a Judge

"gpt-4o-mini"

project_namestrdefault: default_project

Optional

Name of the project for organization

"my_qa_project"

eval_run_namestrdefault: default_eval_run

Optional

Unique name for this evaluation run

"experiment_v1"

overridebooldefault: False

Optional

Whether to override an existing evaluation run with the same name

appendbooldefault: False

Optional

Whether to append to an existing evaluation run with the same name

async_executionbooldefault: False

Optional

Whether to execute the evaluation asynchronously

Example Code

from judgeval import JudgmentClient

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()

example = Example(
  input="What if these shoes don't fit?",
  actual_output="We offer a 30-day full refund at no extra cost.",
  retrieval_context=["All customers are eligible for a 44 day full refund at no extra cost."],
)

scorer = FaithfulnessScorer(threshold=0.5)
client.assert_test(
  examples=[example],
  scorers=[scorer],
)

client.assert_trace_test()

Runs trace-based evaluations as unit tests, raising an exception if the score falls below the defined threshold.

Parameters

scorersList[APIJudgmentScorer]

Required

List of scorers to use for evaluation

[APIJudgmentScorer(...)]

examplesList[Example]

Optional

Examples to run through the function (required if using function)

[Example(...)]

functionCallable

Optional

Function to execute and trace for evaluation

tracerUnion[Tracer, BaseCallbackHandler]

Optional

The tracer object used in tracing your agent

tracesList[Trace]

Optional

Pre-existing traces to evaluate instead of generating new ones

project_namestrdefault: default_project

Optional

Name of the project for organization

"agent_evaluation"

eval_run_namestrdefault: default_eval_run

Optional

Unique name for this trace evaluation run

"agent_trace_v1"

overridebooldefault: False

Optional

Whether to override an existing evaluation run with the same name

appendbooldefault: False

Optional

Whether to append to an existing evaluation run with the same name

Example Code


from judgeval.tracer import Tracer
tracer = Tracer()

def my_agent_function(query: str) -> str:
  """Your agent function to be traced and evaluated"""
  response = f"Processing query: {query}"
  return response

examples = [
  Example(
      input={"query": "What is the weather like?"},
      expected_output="I'll help you check the weather."
  )
]

from judgeval.scorers import ToolOrderScorer    
results = client.assert_trace_test(
  scorers=[ToolOrderScorer()],
  examples=examples,
  function=my_agent_function,
  tracer=tracer,
  project_name="agent_evaluation"
)

Error Handling

The JudgmentClient raises specific exceptions for different error conditions:

Exception	Description
JudgmentAPIError	API request failures or server errors
ValueError	Invalid parameters or configuration
FileNotFoundError	Missing test files or datasets

from judgeval.common.exceptions import JudgmentAPIError

try:
    results = client.run_evaluation(examples, scorers)
except JudgmentAPIError as e:
    print(f"API Error: {e}")
except ValueError as e:
    print(f"Invalid parameters: {e}")

JudgmentClient

JudgmentClient API Reference

Authentication

client.run_evaluation()

Parameters

Returns

Example Code

Response

client.run_trace_evaluation()

Parameters

Returns

Example Code

Response

client.create_dataset()

client.push_dataset()

Parameters

Returns

Example Code

Response

client.pull_dataset()

Parameters

Returns

Example Code

Response

client.append_dataset()

Parameters

Returns

Example Code

Response

client.assert_test()

Parameters

Example Code

client.assert_trace_test()

Parameters

Example Code

Error Handling

On this page