JudgmentClient
Complete reference for the JudgmentClient Python SDK
JudgmentClient API Reference
The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.
Authentication
Set up your credentials using environment variables:
export JUDGMENT_API_KEY="your_api_key_here"
export JUDGMENT_ORG_ID="your_organization_id_here"
client.run_evaluation()
Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models.
Parameters
examples
List[Example]The examples to evaluate against your AI model
[Example(...)]
scorers
List[APIJudgmentScorer]List of scorers to use for evaluation
[APIJudgmentScorer(...)]
model
strdefault: gpt-4.1Model used as judge when using LLM as a Judge
"gpt-4o-mini"
project_name
strdefault: default_projectName of the project for organization
"my_qa_project"
eval_run_name
strdefault: default_eval_runUnique name for this evaluation run
"experiment_v1"
override
booldefault: FalseWhether to override an existing evaluation run with the same name
append
booldefault: FalseWhether to append to an existing evaluation run with the same name
async_execution
booldefault: FalseWhether to execute the evaluation asynchronously
Returns
A list of ScoringResult
objects.
Example Code
from judgeval import JudgmentClient
from judgeval.data import Example
client = JudgmentClient()
examples = [
Example(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris"
)
]
from judgeval.scorers import AnswerRelevancyScorer
results = client.run_evaluation(
examples=examples,
scorers=[AnswerRelevancyScorer(threshold=0.9)],
project_name="geography_qa"
)
Response
[
ScoringResult(
success=False,
scorers_data=[ScorerData(...)],
name=None,
data_object=Example(...),
trace_id=None,
run_duration=None,
evaluation_cost=None
)
]
client.run_trace_evaluation()
Execute trace-based evaluation using function calls and tracing to evaluate agent behavior and execution flows.
Parameters
scorers
List[APIJudgmentScorer]List of scorers to use for evaluation
[APIJudgmentScorer(...)]
examples
List[Example]Examples to run through the function (required if using function)
[Example(...)]
function
CallableFunction to execute and trace for evaluation
tracer
Union[Tracer, BaseCallbackHandler]The tracer object used in tracing your agent
traces
List[Trace]Pre-existing traces to evaluate instead of generating new ones
project_name
strdefault: default_projectName of the project for organization
"agent_evaluation"
eval_run_name
strdefault: default_eval_runUnique name for this trace evaluation run
"agent_trace_v1"
override
booldefault: FalseWhether to override an existing evaluation run with the same name
append
booldefault: FalseWhether to append to an existing evaluation run with the same name
Returns
A list of ScoringResult
objects.
Example Code
from judgeval.tracer import Tracer
tracer = Tracer()
def my_agent_function(query: str) -> str:
"""Your agent function to be traced and evaluated"""
response = f"Processing query: {query}"
return response
examples = [
Example(
input={"query": "What is the weather like?"},
expected_output="I'll help you check the weather."
)
]
from judgeval.scorers import ToolOrderScorer
results = client.run_trace_evaluation(
scorers=[ToolOrderScorer()],
examples=examples,
function=my_agent_function,
tracer=tracer,
project_name="agent_evaluation"
)
Response
[
ScoringResult(
success=False,
scorers_data=[ScorerData(...)],
name=None,
data_object=Example(...),
trace_id=None,
run_duration=None,
evaluation_cost=None
)
]
client.create_dataset()
Create a new evaluation dataset for storage and reuse across multiple evaluation runs.
client.push_dataset()
Upload an evaluation dataset to the Judgment platform for storage and reuse across multiple evaluation runs.
Parameters
alias
strUnique name for the dataset within the project
"qa_dataset_v1"
dataset
EvalDatasetDataset object containing examples and metadata
project_name
strProject name where the dataset will be stored
"question_answering"
overwrite
booldefault: FalseWhether to overwrite existing dataset with same alias
Returns
Returns True
if successful.
Example Code
dataset = client.create_dataset()
dataset.add_examples([
Example(
input="What is machine learning?",
actual_output="Machine learning is a subset of AI...",
expected_output="Machine learning is a method of data analysis..."
)
])
success = client.push_dataset(
alias="ml_qa_dataset_v2",
dataset=dataset,
project_name="machine_learning_qa",
overwrite=True
)
Response
True
client.pull_dataset()
Retrieve a saved dataset from the Judgment platform to use in evaluations or analysis.
Parameters
alias
strThe alias of the dataset to retrieve
"qa_dataset_v1"
project_name
strProject name where the dataset is stored
"question_answering"
Returns
An EvalDataset
object.
Example Code
dataset = client.pull_dataset(
alias="qa_dataset_v1",
project_name="question_answering"
)
print(f"Dataset has {len(dataset.examples)} examples")
results = client.run_evaluation(
examples=dataset.examples,
scorers=my_scorers,
project_name="question_answering"
)
Response
EvalDataset(
examples=[
Example(
input="What is the capital of France?",
actual_output="Paris",
expected_output="Paris"
)
],
metadata={
"created_at": "2024-01-15T10:30:00Z",
"examples_count": 1
}
)
client.append_dataset()
Append examples to an existing dataset.
Parameters
alias
strUnique name for the dataset within the project
"qa_dataset_v1"
examples
List[Example]List of examples to append to the dataset
[Example(...)]
project_name
strProject name where the dataset will be stored
"question_answering"
Returns
Returns True
if successful.
Example Code
dataset = client.create_dataset()
dataset = client.pull_dataset(
alias="qa_dataset_v1",
project_name="question_answering"
)
examples = [
Example(
input="What is the capital of France?",
actual_output="Paris",
expected_output="Paris"
)
]
results = client.append_dataset(
alias="qa_dataset_v1",
examples=examples,
project_name="question_answering"
)
Response
True
client.assert_test()
Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.
Parameters
examples
List[Example]The examples to evaluate against your AI model
[Example(...)]
scorers
List[APIJudgmentScorer]List of scorers to use for evaluation
[APIJudgmentScorer(...)]
model
strdefault: gpt-4.1Model used as judge when using LLM as a Judge
"gpt-4o-mini"
project_name
strdefault: default_projectName of the project for organization
"my_qa_project"
eval_run_name
strdefault: default_eval_runUnique name for this evaluation run
"experiment_v1"
override
booldefault: FalseWhether to override an existing evaluation run with the same name
append
booldefault: FalseWhether to append to an existing evaluation run with the same name
async_execution
booldefault: FalseWhether to execute the evaluation asynchronously
Example Code
from judgeval import JudgmentClient
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 44 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
client.assert_test(
examples=[example],
scorers=[scorer],
)
client.assert_trace_test()
Runs trace-based evaluations as unit tests, raising an exception if the score falls below the defined threshold.
Parameters
scorers
List[APIJudgmentScorer]List of scorers to use for evaluation
[APIJudgmentScorer(...)]
examples
List[Example]Examples to run through the function (required if using function)
[Example(...)]
function
CallableFunction to execute and trace for evaluation
tracer
Union[Tracer, BaseCallbackHandler]The tracer object used in tracing your agent
traces
List[Trace]Pre-existing traces to evaluate instead of generating new ones
project_name
strdefault: default_projectName of the project for organization
"agent_evaluation"
eval_run_name
strdefault: default_eval_runUnique name for this trace evaluation run
"agent_trace_v1"
override
booldefault: FalseWhether to override an existing evaluation run with the same name
append
booldefault: FalseWhether to append to an existing evaluation run with the same name
Example Code
from judgeval.tracer import Tracer
tracer = Tracer()
def my_agent_function(query: str) -> str:
"""Your agent function to be traced and evaluated"""
response = f"Processing query: {query}"
return response
examples = [
Example(
input={"query": "What is the weather like?"},
expected_output="I'll help you check the weather."
)
]
from judgeval.scorers import ToolOrderScorer
results = client.assert_trace_test(
scorers=[ToolOrderScorer()],
examples=examples,
function=my_agent_function,
tracer=tracer,
project_name="agent_evaluation"
)
Error Handling
The JudgmentClient raises specific exceptions for different error conditions:
Exception | Description |
---|---|
JudgmentAPIError | API request failures or server errors |
ValueError | Invalid parameters or configuration |
FileNotFoundError | Missing test files or datasets |
from judgeval.common.exceptions import JudgmentAPIError
try:
results = client.run_evaluation(examples, scorers)
except JudgmentAPIError as e:
print(f"API Error: {e}")
except ValueError as e:
print(f"Invalid parameters: {e}")