Building Useful Evaluations for AI Agents
How to build effective evaluations for AI agents to measure behavior and improve their performance
AI engineers can make countless tweaks to agent design, but how do they know which changes actually improve agent performance? Every prompt change, tool addition, and model selection can significantly impact agent quality—either for better or worse.
Decide what to measure
In most cases, the best evaluation targets are the pain points that appear most frequently—or most severely—in your agent's behavior. These often fall into one of three categories:
Correctness: Is the agent producing factually accurate or logically sound responses?
Goal completion: Is the agent successfully completing the task it was designed to handle?
Task alignment: Is the agent following instructions, using tools appropriately, or responding in a way that's helpful and contextually aware?
Select your eval metrics
Once you've identified the behaviors that matter, you can design custom evals that surface meaningful signals on those behaviors.
Eval Variants
Generally, there are three types of evaluation mechanisms—LLM judge and annotations.
| Eval Type | How it works | Use cases | 
|---|---|---|
| LLM-as-judge | Uses a LLM or system of agents, orchestrated in code, to evaluate and score outputs based on a criteria. | Great for subjective quality or well-defined objective assessments (tone, instruction adherence, hallucination). Poor for vague preference or subject-matter expertise. | 
| Annotations | Humans provide custom labels on agent traces. | Great for subject matter expertise, direct application feedback, and "feels right" assessments. Poor for large scale, cost-effective, or time-sensitive evaluations. | 
Building your own evals
Perhaps you're working in a novel domain, have unique task definitions, or need to evaluate agent behavior against proprietary rules. In these cases, building your own evals is the best way to ensure you're measuring what matters.
Judgment's custom evals module allows you to define:
- What counts as a success or failure, using your own criteria.
- What data to evaluate—a specific step or an entire agent trajectory.
- Whether to score results via heuristics, LLM-as-a-judge, or human annotation.
In judgeval, you can build custom evals via:
Custom Scorers: powerful & flexible, define your own scoring logic in code, with LLMs, or a combination of both.
Prompt Scorers: lightweight, simple LLM-as-judge scorers that classify outputs according to natural language criteria.
What should I use evals for?
Once you've selected or built your evals, you can use them to accomplish many different goals.
| Use Case | Why Use Evals This Way? | 
|---|---|
| Online Evals | Continuously track agent performance in real-time to alert on quality degradation, unusual patterns, or system failures and take automated actions. | 
| A/B Testing | Compare different agent versions or configurations to make data-driven decisions about which approach performs better on your key metrics. See how your agent is improving (or regressing) over time. | 
| Unit Testing | Catch regressions early in development by testing specific agent behaviors against predefined tasks. Ensures code changes (e.g. prompt, tool, model updates) don't break existing functionality. | 
| Optimization Datasets | Create high-quality post-training data by using evals to filter and score agent outputs, which can then be used for fine-tuning or reinforcement learning. For instance, you can separate successful and failed agent traces to create datasets for supervised and reinforcement learning. | 
Learn more
To learn more about implementing evals in judgeval, check out some of our other docs on:
For a deep dive into evals, check out our feature section for evaluation.