Building Useful Evaluations for AI Agents

This page breaks down theoretical concepts of agent evaluation. To get started with actually running evals, check out our evaluation docs!

AI engineers can make countless tweaks to agent design, but how do they know which changes actually improve agent performance? Every prompt change, tool addition, and model selection can significantly impact agent quality—either for better or worse.

Evals help AI engineers assess the impacts of their changes and have emerged as the new CI standard for agents.

Decide what to measure

In most cases, the best evaluation targets are the pain points that appear most frequently—or most severely—in your agent's behavior. These often fall into one of three categories:

Correctness: Is the agent producing factually accurate or logically sound responses?

Goal completion: Is the agent successfully completing the task it was designed to handle?

Task alignment: Is the agent following instructions, using tools appropriately, or responding in a way that's helpful and contextually aware?

If you're not sure where to start, pick a key use case or common user flow and think about what success (or failure) may look like, then try to define measurable properties that capture the outcome.

Select your eval metrics

Once you've identified the behaviors that matter, you can design custom evals that surface meaningful signals on those behaviors.

Eval Variants

Generally, there are three types of evaluation mechanisms—LLM judge and annotations.

Eval Type	How it works	Use cases
LLM-as-judge	Uses a LLM or system of agents, orchestrated in code, to evaluate and score outputs based on a criteria.	Great for subjective quality or well-defined objective assessments (tone, instruction adherence, hallucination). Poor for vague preference or subject-matter expertise.
Annotations	Humans provide custom labels on agent traces.	Great for subject matter expertise, direct application feedback, and "feels right" assessments. Poor for large scale, cost-effective, or time-sensitive evaluations.

Building your own evals

Perhaps you're working in a novel domain, have unique task definitions, or need to evaluate agent behavior against proprietary rules. In these cases, building your own evals is the best way to ensure you're measuring what matters.

Judgment's custom evals module allows you to define:

What counts as a success or failure, using your own criteria.
What data to evaluate—a specific step or an entire agent trajectory.
Whether to score results via heuristics, LLM-as-a-judge, or human annotation.

In judgeval, you can build custom evals via:

Custom Scorers: powerful & flexible, define your own scoring logic in code, with LLMs, or a combination of both.

Prompt Scorers: lightweight, simple LLM-as-judge scorers that classify outputs according to natural language criteria.

What should I use evals for?

Once you've selected or built your evals, you can use them to accomplish many different goals.

Use Case	Why Use Evals This Way?
Online Evals	Continuously track agent performance in real-time to alert on quality degradation, unusual patterns, or system failures and take automated actions.
A/B Testing	Compare different agent versions or configurations to make data-driven decisions about which approach performs better on your key metrics. See how your agent is improving (or regressing) over time.
Unit Testing	Catch regressions early in development by testing specific agent behaviors against predefined tasks. Ensures code changes (e.g. prompt, tool, model updates) don't break existing functionality.
Optimization Datasets	Create high-quality post-training data by using evals to filter and score agent outputs, which can then be used for fine-tuning or reinforcement learning. For instance, you can separate successful and failed agent traces to create datasets for supervised and reinforcement learning.

Learn more

To learn more about implementing evals in judgeval, check out some of our other docs on:

For a deep dive into evals, check out our feature section for evaluation.