LLM Judges

Use natural language rubrics to evaluate agent behavior in your LLM system.

LLM Judges (PromptScorer in the SDK) use natural language rubrics to evaluate agent behavior in your LLM system.

You can combine LLM Judges with Code Judges (Custom Scorers) just like any other judge in judgeval. See the SDK Reference for implementation details.

Create an LLM Judge

Within your project, go to the Judges tab in the sidebar.

Navigate to Judges section

Create your judge

Click New Judge. Choose the type of judge (binary, classification, or score).

Judge Type Selection

Then, click Next to configure the judge.

Configure the judge

Name the judge and feel free to add a description.

Judges are versioned, each version has the following configuration options:

Model: Choose your judge model

Prompt: Define your evaluation criteria (supports {{ mustache template }} variables)

Choices: (for categorical judges): Choices that the judge can pick from

Notes: Notes about the specific version

LLM Judge Configuration

Test the judge

Test your LLM Judge with custom inputs. You can enter these custom inputs manually or add from a dataset. In addition, you can also add other judges or different versions of the same judge.

Then click Run Evaluation.

You'll see the judge's score and reasoning for each judge in the output.

LLM Judge Playground

Next Steps