JudgmentClient
Run evaluations with the JudgmentClient class to test for regressions and run A/B tests on your agents.
The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.
from judgeval import JudgmentClient
import os
from dotenv import load_dotenv
load_dotenv() # Load environment variables from .env file
# Automatically uses JUDGMENT_API_KEY and JUDGMENT_ORG_ID from environment
client = JudgmentClient()
# Or manually pass in API key and Organization ID
client = JudgmentClient(
api_key=os.getenv('JUDGMENT_API_KEY'),
organization_id=os.getenv("JUDGMENT_ORG_ID")
)Authentication
Set up your credentials using environment variables:
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here" # Add to your .env file JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here" JudgmentClient()
Initialize a JudgmentClient object.
JudgmentClient(
api_key: str,
organization_id: str
)Parameters
Your Judgment API key.
Recommended: set with JUDGMENT_API_KEY environment variable.
Your organization ID.
Recommended: set with JUDGMENT_ORG_ID environment variable.
run_evaluation()
Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models.
client.run_evaluation(
examples: List[Example],
scorers: List[ExampleScorer],
model: str = "gpt-5",
project_name: str = "default_project",
eval_run_name: str = "default_eval_run",
assert_test: bool = False,
)Parameters
List of Example objects (or any class inheriting from Example) containing inputs, outputs, and metadata to evaluate against your agents.
List of scorers to use for evaluation, such as PromptScorer,
CustomScorer, or any custom defined
ExampleScorer
Model used as judge when using LLM as a Judge
Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.
Returns
A list of ScoringResult objects. See Return Types for detailed structure.
Example
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")
res = client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
)
# Example with a failing test using assert_test=True
# This will raise an error because the response does not contain the word "package"
try:
example = CustomerRequest(request="Where is my package?", response="Empty response.")
client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
assert_test=True, # This will raise an error if any test fails
)
except Exception as e:
print(f"Test assertion failed: {e}")Return Types
ScoringResult
The ScoringResult object contains the evaluation output of one or more scorers applied to a single example.
Whether all scorers applied to this example succeeded
The original example object that was evaluated
Associated trace ID for trace-based evaluations
ScorerData
Each ScorerData object within scorers_data contains the results from an individual scorer:
# Example of accessing ScoringResult data
results = client.run_evaluation(examples, scorers)
for result in results:
print(f"Overall success: {result.success}")
print(f"Example input: {result.data_object.input}")
for scorer_data in result.scorers_data:
print(f"Scorer '{scorer_data.name}': {scorer_data.score} (threshold: {scorer_data.threshold})")
if scorer_data.reason:
print(f"Reason: {scorer_data.reason}")Error Handling
The JudgmentClient raises specific exceptions for different error conditions:
Raised when API requests fail or server errors occur
Raised when invalid parameters or configuration are provided
Raised when test files or datasets are missing
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentAPIError
client = JudgmentClient()
class CustomerRequest(Example):
request: str
response: str
example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")
class ResolutionScorer(ExampleScorer):
name: str = "Resolution Scorer"
async def a_score_example(self, example: CustomerRequest):
# Replace this logic with your own scoring logic
if "package" in example.response:
self.reason = "The response contains the word 'package'"
return 1
else:
self.reason = "The response does not contain the word 'package'"
return 0
try:
res = client.run_evaluation(
examples=[example],
scorers=[ResolutionScorer()],
project_name="default_project",
)
except JudgmentAPIError as e:
print(f"API Error: {e}")
except ValueError as e:
print(f"Invalid parameters: {e}")
except FileNotFoundError as e:
print(f"File not found: {e}")