Prompt Scorers
A PromptScorer
is a powerful tool for evaluating your LLM system using natural language criteria.
Prompt scorers are great for quick prototyping—you can easily set up new evaluation criteria and test them on a few examples, then benchmark all of your agents' test cases from prod.
judgeval
SDK
Creating a Prompt Scorer
You can create a PromptScorer
by providing a natural language description of your evaluation task/criteria and a set of choices that an LLM judge can choose from when evaluating an example.
Specifically, you need to provide a prompt
that describes the task/criteria.
You can also use custom fields in your prompt
by using the mustache {{variable_name}}
syntax (more details about how to do this in the section below).
Here's an example of creating a PromptScorer
that determines if a response is relevant to a request:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}."
)
Options
You can also provide an options
dictionary where you can specify possible choices for the scorer and assign scores to these choices.
Here's an example of creating a PromptScorer
that determines if a response is relevant to a request, with the options dictionary:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}.",
options={"Yes" : 1, "No" : 0}
)
Retrieving a Prompt Scorer
Once a Prompt Scorer has been created, you can retrieve the prompt scorer by name using the get
class method for the Prompt Scorer. For example, if you had already created the Relevance Scorer from above, you can fetch it with the code below:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.get(
name="Relevance Scorer",
)
Edit Prompt Scorer
You can also edit a prompt scorer that you have already created. You can use the methods get_name
, get_prompt
, and get_options
to get the fields corresponding to the scorer you created. You can update fields with the set_prompt
, set_options
, and set_threshold
methods.
In addition, you can add to the prompt using the append_to_prompt
field.
from judgeval.scorers import PromptScorer
relevancy_scorer = PromptScorer.get(
name="Relevance Scorer",
)
# Adding another sentence to the relevancy scorer prompt
relevancy_scorer.append_to_prompt("Consider whether the response directly addresses the main topic, intent, or question presented in the request.")
# Make additions to options by using the get function and the set function
options = relevancy_scorer.get_options()
options["Maybe"] = 0.5
relevancy_scorer.set_options(options)
# Set threshold for success for the scorer
relevancy_scorer.set_threshold(0.7)
Define Custom Fields
You can create your own custom fields by creating a custom Example which inherits from the base Example object. This allows you to configure any fields you want to score.
For example, to use the relevance scorer from above, you would define a custom Example object with request
and response
fields.
from judgeval.data import Example
class CustomerRequest(Example):
request: str
response: str
example = CustomerRequest(
request="Where is my package?",
response="Your package will arrive tomorrow at 10:00 AM.",
)
Using a Prompt Scorer
Prompt scorers can be used in the same way as any other scorer in judgeval
.
They can also be run in conjunction with other scorers in a single evaluation run!
Putting it all together, you can create the prompt scorer, define your custom fields, and run the prompt scorer within your agentic system like below:
from judgeval.tracer import Tracer
from judgeval.data import Example
from judgeval.scorers import PromptScorer
judgment = Tracer(project_name="prompt_scorer_test_project")
# Define the scorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}.",
options={"Yes": 1, "No": 0}
)
# Define the custom Example class
class CustomerRequest(Example):
request: str
response: str
@judgment.observe(span_type="tool")
def llm_call(request: str):
# Call LLM here to get a response
response = "Your package will arrive tomorrow at 10:00 AM."
# Define the example
example = CustomerRequest(
request=request,
response=response,
)
# Then run your prompt scorer to evaluate the response
judgment.async_evaluate(
scorer=relevance_scorer,
example=example,
model="gpt-4.1",
)
return response
@judgment.observe(span_type="function")
def main():
request = "Where is my package?"
response = llm_call(request)
print(response)
if __name__ == "__main__":
main()
Judgment Platform
Navigate to the PromptScorer tab in the Judgment platform. You'll find this via the sidebar on the left. Here, you can manage the scorers that you have created. You can also create new scorers.
Creating a Scorer
- Click the Create Scorer button in the top right corner. Enter in a name and hit the Next button to go to the next page.
- On this page, you can create a prompt scorer by using a criteria in natural language, supplying your custom fields from your custom Example class.
Then, you can optionally supply a set of choices the scorer can select from when evaluating an example. Once you provide these fields, hit the
Create Scorer
button to finish creating your scorer!
You can now use the scorer in your evaluation runs just like any other scorer in judgeval
.
Scorer Playground
While creating a new scorer or editing an existing scorer, it may be helpful to get a general sense of what your scorer is like. The scorer playground helps you test your PromptScorer
with custom inputs.
When on the page for the scorer you would like to test, select a model from the dropdown and enter in custom inputs for the fields. Then click on the Run Scorer button.
Once you click on the button, the LLM judge will run an evaluation. Once the evaluation results are ready, you will be able to see the score, reason, and choice given by the judge.