Evaluation/Scorers/Single Step Scorers
Answer Relevancy
The answer relevancy scorer is a default LLM judge scorer that measures how relevant the LLM system's actual_output
is to the input
.
In practice, this scorer helps determine whether your RAG pipeline's generator produces relevant answers to the user's query.
There are many factors to consider when evaluating the quality of your RAG pipeline.
judgeval
offers a suite of default scorers to construct a comprehensive evaluation of each RAG component. Check out our guide on RAG system evaluation for a deeper dive! TODO add link here.Required Fields
To run the answer relevancy scorer, you must include the following fields in your Example
:
input
actual_output
Scorer Breakdown
AnswerRelevancy
scores are calculated by extracting statements made in the actual_output
and classifying how many are relevant to the input
.
The score is calculated as:
Sample Implementation
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer
client = JudgmentClient()
example = Example(
input="What's your return policy for a pair of socks?",
# Replace this with your LLM system's output
actual_output="We offer a 30-day return policy for all items, including socks!",
)
# supply your own threshold
scorer = AnswerRelevancyScorer(threshold=0.8)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4o",
)
print(results)
The
AnswerRelevancy
scorer uses an LLM judge, so you'll receive a reason for the score in the reason
field of the results. This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.