LLM Judges
Use natural language rubrics to evaluate agent behavior in your LLM system.
LLM Judges (PromptScorer in the SDK) use natural language rubrics to evaluate agent behavior in your LLM system.
Create an LLM Judge
Create your judge
Click New Judge. Choose the type of judge (binary, classification, or score).


Then, click Next to configure the judge.
Configure the judge
Name the judge and feel free to add a description.
Judges are versioned, each version has the following configuration options:
Model: Choose your judge model
Prompt: Define your evaluation criteria (supports {{ mustache template }} variables)
Choices: (for categorical judges): Choices that the judge can pick from
Notes: Notes about the specific version


Test the judge
Test your LLM Judge with custom inputs. You can enter these custom inputs manually or add from a dataset. In addition, you can also add other judges or different versions of the same judge.
Then click Run Evaluation.
You'll see the judge's score and reasoning for each judge in the output.


Next Steps
- Monitor Agent Behavior in Production - Deploy your judges for real-time agent evaluation.

