Evaluation/Scorers/Abilities
Faithfulness
The Faithfulness
scorer is a default LLM judge scorer that measures how factually aligned the actual_output
is to the retrieval_context
.
Required Fields
To run the Faithfulness
scorer, you must include the following fields in your Example
:
input
actual_output
retrieval_context
Scorer Breakdown
Faithfulness
scores are calculated by first extracting all statements in actual_output
and then classifying
which ones are contradicted by the retrieval_context
.
A claim is considered faithful if it does not contradict any information in retrieval_context
.
The score is calculated as:
Sample Implementation
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What's your return policy for a pair of socks?",
actual_output="We offer a 30-day return policy for all items, including socks!",
retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
scorer = FaithfulnessScorer(threshold=0.8)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4.1",
)
print(results)