Evaluation/Scorers/Agentic Scorers
Derailment
The derailment scorer is a default LLM judge scorer that measures whether steps within your LLM system are deviating/derailing from the initial conversation. Derailment is a common issue in agentic systems, where the LLM may start to stray off the inital topic.
Scorer Breakdown
Derailment
scores are calculated by determing the context from the first step in the Sequence
and then evaluating every step in the Sequence
to see if it is deviating from that generated context.
Derailment only considers the first step in the
Sequence
as the context.Scorer Implementation
from judgeval import JudgmentClient
from judgeval.data import Example, Sequence
from judgeval.scorers import DerailmentScorer
client = JudgmentClient()
airlines_example = Example(
input="Which airlines fly to Paris?",
actual_output="Air France, Delta, and American Airlines offer direct flights."
)
airline_followup = Example(
input="Which airline is the best for a family of 4?",
actual_output="Delta is the best airline for a family of 4."
)
weather_example = Example(
input="What is the weather like in Texas?",
actual_output="It's sunny with a high of 75°F in Texas."
)
airline_sequence = Sequence(
name="Flight Details",
items=[airlines_example, airline_followup, weather_example]
)
results = client.run_sequence_evaluation(
sequences=[airline_sequence],
scorers=[DerailmentScorer(threshold=0.5)],
model="gpt-4o",
log_results=True,
override=True,
)
You would expect a derailment score of 0.66 for this
airline_sequence
because only the last step is deviating from the initial context.