Prompt Scorers
A PromptScorer
is a powerful tool for evaluating your LLM system using use-case specific, natural language rubrics.
PromptScorer's make it easy to prototype your evaluation rubrics—you can easily set up a new criteria and test them on a few examples in the scorer playground, then evaluate your agents' behavior in production with real customer usage.
You can create a PromptScorer
on the SDK or the Judgment Platform.
judgeval
SDK
Quickstart
Prompt scorers can be used in the same way as any other scorer in judgeval
.
They can also be run in conjunction with other scorers in a single evaluation run!
You can create the prompt scorer, define your custom fields, and run the prompt scorer within your agentic system like below:
from judgeval.tracer import Tracer
from judgeval.data import Example
from judgeval.scorers import PromptScorer
judgment = Tracer(project_name="prompt_scorer_test_project")
# Define the scorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}.",
options={"Yes": 1, "No": 0}
)
# Define the custom Example class
class CustomerRequest(Example):
request: str
response: str
@judgment.observe(span_type="tool")
def llm_call(request: str):
# Call LLM here to get a response
response = "Your package will arrive tomorrow at 10:00 AM."
# Define the example
example = CustomerRequest(
request=request,
response=response,
)
# Then run your prompt scorer to evaluate the response
judgment.async_evaluate(
scorer=relevance_scorer,
example=example,
model="gpt-5"
)
return response
@judgment.observe(span_type="function")
def main():
request = "Where is my package?"
response = llm_call(request)
print(response)
if __name__ == "__main__":
main()
Creating a Prompt Scorer
You can create a PromptScorer
by providing a natural language description of your evaluation task/criteria and a set of choices that an LLM judge can choose from when evaluating an example.
Specifically, you need to provide a prompt
that describes the task/criteria.
You can also use custom fields in your prompt
by using the mustache {{variable_name}}
syntax (more details about how to do this in the section below).
Here's an example of creating a PromptScorer
that determines if a response is relevant to a request:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}."
)
Options
You can also provide an options
dictionary where you can specify possible choices for the scorer and assign scores to these choices.
Here's an example of creating a PromptScorer
that determines if a response is relevant to a request, with the options dictionary:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.create(
name="Relevance Scorer",
prompt="Is the request relevant to the response? The request is {{request}} and the response is {{response}}.",
options={"Yes" : 1, "No" : 0}
)
Retrieving a Prompt Scorer
Once a Prompt Scorer has been created, you can retrieve the prompt scorer by name using the get
class method for the Prompt Scorer. For example, if you had already created the Relevance Scorer from above, you can fetch it with the code below:
from judgeval.scorers import PromptScorer
relevance_scorer = PromptScorer.get(
name="Relevance Scorer",
)
Edit Prompt Scorer
You can also edit a prompt scorer that you have already created. You can use the methods get_name
, get_prompt
, and get_options
to get the fields corresponding to the scorer you created. You can update fields with the set_prompt
, set_options
, and set_threshold
methods.
In addition, you can add to the prompt using the append_to_prompt
field.
from judgeval.scorers import PromptScorer
relevancy_scorer = PromptScorer.get(
name="Relevance Scorer",
)
# Adding another sentence to the relevancy scorer prompt
relevancy_scorer.append_to_prompt("Consider whether the response directly addresses the main topic, intent, or question presented in the request.")
# Make additions to options by using the get function and the set function
options = relevancy_scorer.get_options()
options["Maybe"] = 0.5
relevancy_scorer.set_options(options)
# Set threshold for success for the scorer
relevancy_scorer.set_threshold(0.7)
Define Custom Fields
You can create your own custom fields by creating a custom Example which inherits from the base Example object. This allows you to configure any fields you want to score.
For example, to use the relevance scorer from above, you would define a custom Example object with request
and response
fields.
from judgeval.data import Example
class CustomerRequest(Example):
request: str
response: str
example = CustomerRequest(
request="Where is my package?",
response="Your package will arrive tomorrow at 10:00 AM.",
)
Using a Prompt Scorer
Prompt scorers can be used in the same way as any other scorer in judgeval
.
They can also be run in conjunction with other scorers in a single evaluation run!
Putting it all together, you can retrieve a prompt scorer, define your custom fields, and run the prompt scorer within your agentic system like below:
from judgeval.tracer import Tracer
from judgeval.data import Example
from judgeval.scorers import PromptScorer
judgment = Tracer(project_name="prompt_scorer_test_project")
# Retrieve the scorer
relevance_scorer = PromptScorer.get(
name="Relevance Scorer"
)
# Define the custom Example class
class CustomerRequest(Example):
request: str
response: str
@judgment.observe(span_type="tool")
def llm_call(request: str):
# Call LLM here to get a response
response = "Your package will arrive tomorrow at 10:00 AM."
# Define the example
example = CustomerRequest(
request=request,
response=response,
)
# Then run your prompt scorer to evaluate the response
judgment.async_evaluate(
scorer=relevance_scorer,
example=example,
model="gpt-4.1"
)
return response
@judgment.observe(span_type="function")
def main():
request = "Where is my package?"
response = llm_call(request)
print(response)
if __name__ == "__main__":
main()
Trace Prompt Scorers
A TracePromptScorer
is a special type of prompt scorer which runs on a full trace or subtree of a trace rather than on an Example
or custom Example
. You can use a TracePromptScorer
if you want your scorer to have multiple trace spans as context for the LLM judge.
Creating a Trace Prompt Scorer
Creating a Trace Prompt Scorer is very similar to defining a Prompt Scorer. Since it is not evaluated over an Example
object, there is no need to have any of the placeholders with mustache syntax as required for a regular PromptScorer
.
The syntax for creating, retrieving, and editing the scorer is otherwise identical to the PromptScorer
.
from judgeval.scorers import TracePromptScorer
trace_scorer = TracePromptScorer.create(
name="Trace Scorer",
prompt="Is the structure of this trace coherent?"
)
Running a Trace Prompt Scorer
Running a trace prompt scorer can be done through the observe
decorator. You will need to make a TraceScorerConfig
object and pass in the TracePromptScorer
into the object. The span that is observed and all children spans will be given to the LLM judge.
Putting it all together, you can run your trace prompt scorer within your agentic system like below:
from judgeval.tracer import Tracer, TraceScorerConfig
from judgeval.scorers import TracePromptScorer
judgment = Tracer(project_name="prompt_scorer_test_project")
# Retrieve the scorer
trace_scorer = TracePromptScorer.get(
name="Trace Scorer"
)
@judgment.observe(span_type="function")
def sample_trace_span(sample_arg):
print(f"This is a sample trace span with sample arg {sample_arg}")
@judgment.observe(span_type="function", scorer_config=TraceScorerConfig(scorer=trace_scorer, model="gpt-5"))
def main():
sample_trace_span("test")
if __name__ == "__main__":
main()
Judgment Platform
Navigate to the Scorers tab in the Judgment platform. You'll find this via the sidebar on the left. Ensure you are on the PromptScorer
section.
Here, you can manage the prompt scorers that you have created. You can also create new prompt scorers.
Creating a Scorer
- Click the New PromptScorer button in the top right corner. Enter in a name, select the type of scorer, and hit the Next button to go to the next page.
- On this page, you can create a prompt scorer by using a criteria in natural language, supplying your custom fields from your custom Example class. In addition, add the threshold needed for the score returned by the LLM judge to be considered a success
Then, you can optionally supply a set of choices the scorer can select from when evaluating an example. Once you provide these fields, hit the
Create Scorer
button to finish creating your scorer!
You can now use the scorer in your evaluation runs just like any other scorer in judgeval
.
Scorer Playground
While creating a new scorer or editing an existing scorer, it may be helpful to get a general sense of what your scorer is like. The scorer playground helps you test your PromptScorer
with custom inputs.
When on the page for the scorer you would like to test, select a model from the dropdown and enter in custom inputs for the fields. Then click on the Run Scorer button.
Once you click on the button, the LLM judge will run an evaluation. Once the evaluation results are ready, you will be able to see the score, reason, and choice given by the judge.
Next Steps
Ready to use your custom scorers in production? Learn how to monitor agent behavior with online evaluations.
Monitor Agent Behavior in Production
Use Custom Scorers to continuously evaluate your agents in real-time production environments.