Examples
Overview
An Example
is a basic unit of data in judgeval
that allows you to run evaluation scorers on your LLM system.
An Example
is can be composed of a mixture of the following fields:
input
[Optional]actual_output
[Optional]expected_output
[Optional]retrieval_context
[Optional] (Python)context
[Optional]
Here's a sample of creating an Example
:
from judgeval.data import Example
example = Example(
input="Who founded Microsoft?",
actual_output="Bill Gates and Paul Allen.",
expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
context=["Bill Gates and Paul Allen are the founders of Microsoft."],
)
input
and actual_output
fields are required for all examples. However, you don't need to use them in your evaluations. For example, if you're evaluating whether a chatbot's response is friendly, you don't need to use input
. Other fields are optional and depend on the type of evaluation. If you want to detect hallucinations in a RAG system, you'd use the retrieval_context
field (Python) or retrievalContext
(Typescript) for the Faithfulness scorer.Example Fields
Here, we cover the possible fields that make up an Example
.
Input
The input
field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and SHOULD NOT CONTAIN your prompt template itself.
Actual Output
The actual_output
field represents what your LLM system outputs based on the input
.
This is the actual output of your LLM system created either at evaluation time or with saved answers.
# Sample app implementation
import medical_chatbot
from judgeval.data import Example # Added import
question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question)
)
Expected Output
The expected_output
field is Optional[str]
and represents the ideal output of your LLM system.
One great part of judgeval
's scorers is that they use LLMs which have flexible evaluation criteria. You don't need to worry about your expected_output
perfectly matching your actual_output
!
To learn more about how judgeval
's scorers work, please see the scorer docs.
# Sample app implementation
import medical_chatbot
from judgeval.data import Example # Added import
question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy."
)
Context
The context
field is Optional[List[str]]
and represents information that is supplied to the LLM system as ground truth.
For instance, context could be a list of facts that the LLM system is aware of. However, context
should not be confused with retrieval_context
.
judgeval
by retrieval_context
, not context
. If you're building a RAG system, you'll want to use retrieval_context
.# Sample app implementation
import medical_chatbot
from judgeval.data import Example # Added import
question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."]
)
Retrieval Context
The retrieval_context
field is Optional[List[str]]
and represents the context that is actually retrieved from a vector database.
This is often the context that is used to generate the actual_output
in a RAG system.
Some common cases for using retrieval_context
are:
- Checking for hallucinations in a RAG system
- Evaluating the quality of a retriever model by comparing
retrieval_context
tocontext
# Sample app implementation
import medical_chatbot
from judgeval.data import Example # Added import
question = "Is sparkling water healthy?"
example = Example(
input=question,
actual_output=medical_chatbot.chat(question),
expected_output="Sparkling water is neither healthy nor unhealthy.",
context=["Sparkling water is a type of water that is carbonated."],
retrieval_context=["Sparkling water is carbonated and has no calories."]
)
Custom Examples
Custom Examples are a way to evaluate your LLM system with more complex data structures. It is composed of the same fields as an Example
, but the fields are dictionaries instead of strings.
from judgeval.data import CustomExample
example = CustomExample(
input={"question": "Is sparkling water healthy?"},
actual_output={"answer": "Sparkling water is neither healthy nor unhealthy."},
)
Custom Examples are designed to be used with our Custom Scorers. See the Custom Scorers documentation for more information.
Conclusion
Congratulations! 🎉
You've learned how to create an Example
and can begin using them to execute evaluations or create datasets.