Introduction to Judgeval

Judgeval is an evaluation library for AI agents, developed and maintained by Judgment Labs. Designed for AI teams, Judgeval makes it easy to benchmark, test, and improve AI agents, including critical aspects of their behavior such as multi-step reasoning and agent tool calling accuracy.

Getting Started

Ready to dive in? Follow our Getting Started guide to set up Judgeval and run your first evaluations.

Judgeval integrates natively with the Judgment Labs Platform, allowing you to evaluate, unit test, and monitor LLM applications in the cloud.

Judgeval was built by a passionate team of LLM researchers from Stanford, Datadog, and Together AI 💜.

Quickstarts

Running Judgeval for the first time? Check out the quickstarts or cookbooks below.

Introduction to Judgeval

Getting Started

Quickstarts

Evaluation

Monitoring

Insights

Cookbooks

On this page