Evaluation

AI engineers can make countless tweaks to agent design, but how do they know if these changes are actually improve agent performance? Every prompt change, new tool addition, retrieval modification, and model selection can significantly impact on agent quality—either for better or worse.

Evals help AI engineers assess how modifications to agent design affect quality, enabling them to make data-driven decisions.

Decide what to measure

In most cases, the best evaluation targets are the pain points that appear most frequently—or most severely—in your agent's behavior. These often fall into one of three categories:

Correctness: Is the agent producing factually accurate or logically sound responses?
Goal completion: Is the agent successfully completing the task it was designed to handle?
Task alignment: Is the agent following instructions, using tools appropriately, or responding in a way that's helpful and contextually aware?

To identify where to focus your evals:

Trace your failures: Use logs or user feedback to identify common failure modes (e.g. "agent forgets prior instructions" or "agent ignores context information"). You can use clustering on large data collections to surface these patterns.
Map your issues: group issues by quality vector (e.g. hallucination, tool misuse, low recall) to narrow down which metrics could be most informative.
Decide eval granularity: do you need to measure performance at the step level (e.g. instruction following in a single LLM response) or across the entire task?

If you're not sure where to start, pick a key use case or common user flow and think about what success (or failure) may look like, then try to define measurable properties that capture the outcome.

Select your eval metrics

Once you've identified the behaviors that matter, you can choose or design evals that surface meaningful signals on those behaviors.

We've constructed several out-of-the-box evals for common evaluation cases such as instruction adherence, hallucination, and tool selection/parameter accuracy.

Our out-of-the-box evals are organized by the agent module they're most relevant to:

Planning
Tool Calling
Abilities / General Agent abilities
Memory

Eval Variants

There are three types of evaluators you can select or build—LLM judge, code, and annotations. Each has its strengths and weaknesses, depending on what you're attempting to measure.

Eval Type	How it works	Ideal use cases
LLM-as-judge	Uses a LLM or system of agents to evaluate and score outputs based on a criteria.	Great for subjective quality or well-defined objective assessments (tone, instruction adherence, hallucination). Poor for vague preference or subject-matter expertise.
Code	Algorithms compute scores based on rules and patterns.	Great for reducing cost, latency, and environment-based measurements like tool execution, exceptions, etc. Poor for qualitative measurements such as summarization.
Annotations	Humans provide custom labels on agent traces.	Great for subject matter expertise, direct application feedback, and "feels right" assessments. Poor for large scale, cost-effective, or time-sensitive evaluations.

Building your own evals

While we recommend starting with our built-in, plug-and-play evals, there will be times when your use case requires something custom. Perhaps you're working in a novel domain, have unique task definitions, or need to evaluate agent behavior against proprietary rules. In these cases, building your own evals is the best way to ensure you're measuring what matters.

Judgment's custom evals module allows you to define:

What counts as a success or failure, using your own criteria.
What data to evaluate—a specific step, entire agent trajectory, or even external influences.
How to score results, whether via heuristics, LLM-as-a-judge, or human annotation.

In the judgeval library, you can build custom evals via:

Custom scorers: powerful & flexible, define your own scoring logic in code, with LLMs, or a combination of both.
Classifier scorers: lightweight, simple LLM-as-judge scorers that classify outputs according to natural language criteria.

What should I use evals for?

Once you've selected or built your evals, you can use them to accomplish many different goals.

Use Case	Why Use Evals This Way?
Unit Testing	Catch regressions early in development by testing specific agent behaviors against predefined tasks. Ensures code changes (e.g. prompt, tool, model updates) don't break existing functionality.
Online Evals	Continuously track agent performance in real-time to alert on quality degradation, unusual patterns, or system failures and take automated actions.
A/B Testing	Compare different agent versions or configurations to make data-driven decisions about which approach performs better on your key metrics. See how your agent is improving (or regressing) over time.
Optimization Datasets	Create high-quality post-training data by using evals to filter and score agent outputs, which can then be used for fine-tuning or reinforcement learning. For instance, you can separate successful and failed agent traces to create datasets for supervised and reinforcement learning.

Learn more

To learn more about implementing evals in judgeval, check out some of our other docs on:

For a deep dive into evals, check out our feature section for evaluation.