JudgmentClient

The JudgmentClient is your primary interface for interacting with the Judgment platform. It provides methods for running evaluations, managing datasets, handling traces, and more.

from judgeval import JudgmentClient
import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file

# Automatically uses JUDGMENT_API_KEY and JUDGMENT_ORG_ID from environment
client = JudgmentClient()

# Or manually pass in API key and Organization ID
client = JudgmentClient(
    api_key=os.getenv('JUDGMENT_API_KEY'),
    organization_id=os.getenv("JUDGMENT_ORG_ID")
)

Authentication

Set up your credentials using environment variables:

export JUDGMENT_API_KEY="your_key_here" 
export JUDGMENT_ORG_ID="your_org_id_here"

# Add to your .env file JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"

JudgmentClient()

Initialize a JudgmentClient object.

JudgmentClient(
  api_key: str,
  organization_id: str
)

Parameters

api_keyrequired

:str

Your Judgment API key.

Recommended: set with JUDGMENT_API_KEY environment variable.

organization_idrequired

:str

Your organization ID.

Recommended: set with JUDGMENT_ORG_ID environment variable.

run_evaluation()

Execute an evaluation of examples using one or more scorers to measure performance and quality of your AI models.

client.run_evaluation(
    examples: List[Example],
    scorers: List[ExampleScorer],
    model: str = "gpt-5",
    project_name: str = "default_project",
    eval_run_name: str = "default_eval_run",
    assert_test: bool = False,
)

Parameters

examplesrequired

:List[Example]

List of Example objects (or any class inheriting from Example) containing inputs, outputs, and metadata to evaluate against your agents.

scorersrequired

:List[ExampleScorer]

List of scorers to use for evaluation, such as PromptScorer, CustomScorer, or any custom defined ExampleScorer

model

:str

Model used as judge when using LLM as a Judge

Default: "gpt-5"

project_name

:str

Name of the project for organization

Default: "default_project"

eval_run_name

:str

Name for the evaluation run

Default: "default_eval_run"

assert_test

:bool

Runs evaluations as unit tests, raising an exception if the score falls below the defined threshold.

Default: False

Returns

A list of ScoringResult objects. See Return Types for detailed structure.

Example

resolution.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")

res = client.run_evaluation(
    examples=[example],
    scorers=[ResolutionScorer()],
    project_name="default_project",
)

# Example with a failing test using assert_test=True
# This will raise an error because the response does not contain the word "package"
try:
    example = CustomerRequest(request="Where is my package?", response="Empty response.")
    client.run_evaluation(
        examples=[example],
        scorers=[ResolutionScorer()],
        project_name="default_project",
        assert_test=True,  # This will raise an error if any test fails
    )
except Exception as e:
    print(f"Test assertion failed: {e}")

Return Types

ScoringResult

The ScoringResult object contains the evaluation output of one or more scorers applied to a single example.

success

:bool

Whether all scorers applied to this example succeeded

scorers_data

:List[ScorerData]

Individual scorer results and metadata

data_object

:Example

The original example object that was evaluated

run_duration

:Optional[float]

Time taken to complete the evaluation

trace_id

:Optional[str]

Associated trace ID for trace-based evaluations

evaluation_cost

:Optional[float]

Cost of the evaluation in USD

ScorerData

Each ScorerData object within scorers_data contains the results from an individual scorer:

name

:str

Name of the scorer

threshold

:float

Threshold used for pass/fail determination

success

:bool

Whether this scorer passed its threshold

score

:Optional[float]

Numerical score from the scorer

reason

:Optional[str]

Explanation for the score/decision

evaluation_model

:Optional[Union[List[str], str]]

Model(s) used for evaluation

error

:Optional[str]

Error message if scoring failed

accessing_results.py

# Example of accessing ScoringResult data
results = client.run_evaluation(examples, scorers)

for result in results:
    print(f"Overall success: {result.success}")
    print(f"Example input: {result.data_object.input}")

    for scorer_data in result.scorers_data:
        print(f"Scorer '{scorer_data.name}': {scorer_data.score} (threshold: {scorer_data.threshold})")
        if scorer_data.reason:
            print(f"Reason: {scorer_data.reason}")

Error Handling

The JudgmentClient raises specific exceptions for different error conditions:

JudgmentAPIError

:Exception

Raised when API requests fail or server errors occur

ValueError

:Exception

Raised when invalid parameters or configuration are provided

FileNotFoundError

:Exception

Raised when test files or datasets are missing

error_handling.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from judgeval.exceptions import JudgmentAPIError

client = JudgmentClient()

class CustomerRequest(Example):
    request: str
    response: str

example = CustomerRequest(request="Where is my package?", response="Your package will arrive tomorrow at 10:00 AM.")

class ResolutionScorer(ExampleScorer):
    name: str = "Resolution Scorer"

    async def a_score_example(self, example: CustomerRequest):
        # Replace this logic with your own scoring logic
        if "package" in example.response:
            self.reason = "The response contains the word 'package'"
            return 1
        else:
            self.reason = "The response does not contain the word 'package'"
            return 0

try:
    res = client.run_evaluation(
        examples=[example],
        scorers=[ResolutionScorer()],
        project_name="default_project",
    )
except JudgmentAPIError as e:
    print(f"API Error: {e}")
except ValueError as e:
    print(f"Invalid parameters: {e}")
except FileNotFoundError as e:
    print(f"File not found: {e}")

JudgmentClient

api_keyrequired

organization_idrequired

examplesrequired

scorersrequired

model

project_name

eval_run_name

assert_test

success

scorers_data

data_object

run_duration

trace_id

evaluation_cost

name

threshold

success

score

reason

evaluation_model

error

JudgmentAPIError

ValueError

FileNotFoundError

On this page