Judgeval Python SDK

JudgmentClient

Run evaluations with the JudgmentClient class to test for regressions and run A/B tests on your agents.

The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.

from judgeval import JudgmentClient
import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file

# Automatically uses JUDGMENT_API_KEY and JUDGMENT_ORG_ID from environment
client = JudgmentClient()

# Or manually pass in API key and Organization ID
client = JudgmentClient(
    api_key=os.getenv('JUDGMENT_API_KEY'),
    organization_id=os.getenv("JUDGMENT_ORG_ID")
)

Authentication

Set up your credentials using environment variables:

export JUDGMENT_API_KEY="your_key_here" 
export JUDGMENT_ORG_ID="your_org_id_here" 
# Add to your .env file JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here" 

JudgmentClient()

Initialize a JudgmentClient object.

JudgmentClient(
  api_key: str,
  organization_id: str
)

Parameters

api_keyrequired:str

Your Judgment API key.

Recommended: set with JUDGMENT_API_KEY environment variable.

organization_idrequired:str

Your organization ID.

Recommended: set with JUDGMENT_ORG_ID environment variable.


run_evaluation()

Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models.

client.run_evaluation(
    examples: List[Example],
    scorers: List[ExampleScorer],
    model: str = "gpt-5",
    project_name: str = "default_project",
    eval_run_name: str = "default_eval_run",
    assert_test: bool = False,
)

Parameters

examplesrequired:List[Example]

List of Example objects (or any class inheriting from Example) containing inputs, outputs, and metadata to evaluate against your agents.

scorersrequired:List[ExampleScorer]

List of scorers to use for evaluation, such as PromptScorer, CustomScorer, or any custom defined ExampleScorer

model:str

Model used as judge when using LLM as a Judge

Default: "gpt-5"
project_name:str
Name of the project for organization
Default: "default_project"
eval_run_name:str
Name for the evaluation run
Default: "default_eval_run"
assert_test:bool

Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.

Default: False

Returns

A list of ScoringResult objects. See Return Types for detailed structure.

Example

resolution.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")

res = client.run_evaluation(
    examples=[example],
    scorers=[ResolutionScorer()],
    project_name="default_project",
)

# Example with a failing test using assert_test=True
# This will raise an error because the response does not contain the word "package"
try:
    example = CustomerRequest(request="Where is my package?", response="Empty response.")
    client.run_evaluation(
        examples=[example],
        scorers=[ResolutionScorer()],
        project_name="default_project",
        assert_test=True,  # This will raise an error if any test fails
    )
except Exception as e:
    print(f"Test assertion failed: {e}")

Return Types

ScoringResult

The ScoringResult object contains the evaluation output of one or more scorers applied to a single example.

success:bool

Whether all scorers applied to this example succeeded

scorers_data:List[ScorerData]
Individual scorer results and metadata
data_object:Example

The original example object that was evaluated

run_duration:Optional[float]
Time taken to complete the evaluation
trace_id:Optional[str]

Associated trace ID for trace-based evaluations

evaluation_cost:Optional[float]
Cost of the evaluation in USD

ScorerData

Each ScorerData object within scorers_data contains the results from an individual scorer:

name:str
Name of the scorer
threshold:float
Threshold used for pass/fail determination
success:bool
Whether this scorer passed its threshold
score:Optional[float]
Numerical score from the scorer
reason:Optional[str]
Explanation for the score/decision
evaluation_model:Optional[Union[List[str], str]]
Model(s) used for evaluation
error:Optional[str]
Error message if scoring failed
accessing_results.py
# Example of accessing ScoringResult data
results = client.run_evaluation(examples, scorers)

for result in results:
    print(f"Overall success: {result.success}")
    print(f"Example input: {result.data_object.input}")

    for scorer_data in result.scorers_data:
        print(f"Scorer '{scorer_data.name}': {scorer_data.score} (threshold: {scorer_data.threshold})")
        if scorer_data.reason:
            print(f"Reason: {scorer_data.reason}")

Error Handling

The JudgmentClient raises specific exceptions for different error conditions:

JudgmentAPIError:Exception

Raised when API requests fail or server errors occur

ValueError:Exception

Raised when invalid parameters or configuration are provided

FileNotFoundError:Exception

Raised when test files or datasets are missing

error_handling.py
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentAPIError

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

try:
    res = client.run_evaluation(
        examples=[example],
        scorers=[ResolutionScorer()],
        project_name="default_project",
    )
except JudgmentAPIError as e:
    print(f"API Error: {e}")
except ValueError as e:
    print(f"Invalid parameters: {e}")
except FileNotFoundError as e:
    print(f"File not found: {e}")