Evaluation
Foundations of Agent Development
AI engineers can make countless tweaks to agent design, but how do they know if these changes are actually improve agent performance? Every prompt change, new tool addition, retrieval modification, and model selection can significantly impact on agent quality—either for better or worse.
Evals help AI engineers assess how modifications to agent design affect quality, enabling them to make data-driven decisions.
Decide what to measure
In most cases, the best evaluation targets are the pain points that appear most frequently—or most severely—in your agent's behavior. These often fall into one of three categories:
- Correctness: Is the agent producing factually accurate or logically sound responses?
- Goal completion: Is the agent successfully completing the task it was designed to handle?
- Task alignment: Is the agent following instructions, using tools appropriately, or responding in a way that's helpful and contextually aware?
To identify where to focus your evals:
- Trace your failures: Use logs or user feedback to identify common failure modes (e.g. "agent forgets prior instructions" or "agent ignores context information"). You can use clustering on large data collections to surface these patterns.
- Map your issues: group issues by quality vector (e.g. hallucination, tool misuse, low recall) to narrow down which metrics could be most informative.
- Decide eval granularity: do you need to measure performance at the step level (e.g. instruction following in a single LLM response) or across the entire task?
Select your eval metrics
Once you've identified the behaviors that matter, you can choose or design evals that surface meaningful signals on those behaviors.
We've constructed several out-of-the-box evals for common evaluation cases such as instruction adherence, hallucination, and tool selection/parameter accuracy.
Our out-of-the-box evals are organized by the agent module they're most relevant to:
- Planning
- Tool Calling
- Abilities / General Agent abilities
- Memory
Eval Variants
There are three types of evaluators you can select or build—LLM judge, code, and annotations. Each has its strengths and weaknesses, depending on what you're attempting to measure.
Eval Type | How it works | Ideal use cases |
---|---|---|
LLM-as-judge | Uses a LLM or system of agents to evaluate and score outputs based on a criteria. | Great for subjective quality or well-defined objective assessments (tone, instruction adherence, hallucination). Poor for vague preference or subject-matter expertise. |
Code | Algorithms compute scores based on rules and patterns. | Great for reducing cost, latency, and environment-based measurements like tool execution, exceptions, etc. Poor for qualitative measurements such as summarization. |
Annotations | Humans provide custom labels on agent traces. | Great for subject matter expertise, direct application feedback, and "feels right" assessments. Poor for large scale, cost-effective, or time-sensitive evaluations. |
Building your own evals
While we recommend starting with our built-in, plug-and-play evals, there will be times when your use case requires something custom. Perhaps you're working in a novel domain, have unique task definitions, or need to evaluate agent behavior against proprietary rules. In these cases, building your own evals is the best way to ensure you're measuring what matters.
Judgment's custom evals module allows you to define:
- What counts as a success or failure, using your own criteria.
- What data to evaluate—a specific step, entire agent trajectory, or even external influences.
- How to score results, whether via heuristics, LLM-as-a-judge, or human annotation.
In the judgeval
library, you can build custom evals via:
- Custom scorers: powerful & flexible, define your own scoring logic in code, with LLMs, or a combination of both.
- Classifier scorers: lightweight, simple LLM-as-judge scorers that classify outputs according to natural language criteria.
What should I use evals for?
Once you've selected or built your evals, you can use them to accomplish many different goals.
Use Case | Why Use Evals This Way? |
---|---|
Unit Testing | Catch regressions early in development by testing specific agent behaviors against predefined tasks. Ensures code changes (e.g. prompt, tool, model updates) don't break existing functionality. |
Online Evals | Continuously track agent performance in real-time to alert on quality degradation, unusual patterns, or system failures and take automated actions. |
A/B Testing | Compare different agent versions or configurations to make data-driven decisions about which approach performs better on your key metrics. See how your agent is improving (or regressing) over time. |
Optimization Datasets | Create high-quality post-training data by using evals to filter and score agent outputs, which can then be used for fine-tuning or reinforcement learning. For instance, you can separate successful and failed agent traces to create datasets for supervised and reinforcement learning. |
Learn more
To learn more about implementing evals in judgeval
, check out some of our other docs on:
For a deep dive into evals, check out our feature section for evaluation.