Evaluation/Scorers/Single Step Scorers
Contextual Relevancy
The contextual relevancy scorer is a default LLM judge scorer that measures how relevant the contexts in retrieval_context
are for an input
.
In practice, this scorer helps determine whether your RAG pipeline's retriever effectively retrieves relevant contexts for a query.
Required Fields
To run the contextual relevancy scorer, you must include the following fields in your Example
:
input
actual_output
retrieval_context
Scorer Breakdown
ContextualRelevancy
scores are calculated by first extracting all statements in retrieval_context
and then classifying
which ones are relevant to the input
.
The score is then calculated as:
Our contextual relevancy scorer is based on Stanford NLP's ARES paper (Saad-Falcon et. al., 2024).
Sample Implementation
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ContextualRelevancyScorer
client = JudgmentClient()
example = Example(
input="What's your return policy for a pair of socks?",
# Replace this with your LLM system's output
actual_output="We offer a 30-day return policy for all items, including socks!",
# Replace this with the contexts retrieved by your RAG retriever
retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
)
# supply your own threshold
scorer = ContextualRelevancyScorer(threshold=0.8)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4o",
)
print(results)
The
ContextualRelevancy
scorer uses an LLM judge, so you'll receive a reason for the score in the reason
field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.