Faithfulness

The Faithfulness scorer is a default LLM judge scorer that measures how factually aligned the actual_output is to the retrieval_context.

This scorer is useful for determining the degree to which your agent's responses contain hallucinations.

Required Fields

To run the Faithfulness scorer, you must include the following fields in your Example:

input
actual_output
retrieval_context

Faithfulness scores are calculated by first extracting all statements in actual_output and then classifying which ones are contradicted by the retrieval_context. A claim is considered faithful if it does not contradict any information in retrieval_context.

The score is calculated as:

\text{Faithfulness} = \frac{\text{Number of Faithful Statements}}{\text{Total Number of Statements}}

Sample Implementation

faithfulness.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer

client = JudgmentClient()
example = Example(
    input="What's your return policy for a pair of socks?",
    actual_output="We offer a 30-day return policy for all items, including socks!",
    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
scorer = FaithfulnessScorer(threshold=0.8)

results = client.run_evaluation(
    examples=[example],
    scorers=[scorer],
    model="gpt-4.1",
)
print(results)

The Faithfulness scorer uses an LLM judge, so you'll receive a reason for the score in the reason field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.

Faithfulness

Required Fields

Scorer Breakdown

Sample Implementation

On this page