Evaluation
Foundations of Agent Development
AI engineers can make countless tweaks to agent design, but how do they know which changes actually improve agent performance? Every prompt change, tool addition, and model selection can significantly impact agent quality—either for better or worse.
Decide what to measure
In most cases, the best evaluation targets are the pain points that appear most frequently—or most severely—in your agent's behavior. These often fall into one of three categories:
Correctness: Is the agent producing factually accurate or logically sound responses?
Goal completion: Is the agent successfully completing the task it was designed to handle?
Task alignment: Is the agent following instructions, using tools appropriately, or responding in a way that's helpful and contextually aware?
Select your eval metrics
Once you've identified the behaviors that matter, you can choose a pre-built scorer or design custom evals that surface meaningful signals on those behaviors.
Try built-in evals
We've constructed several plug-and-play evals for common evaluation cases such as instruction adherence, hallucination, and tool selection/parameter accuracy.
Eval Variants
Generally, there are three types of evaluation mechanisms—LLM judge
, code
, and annotations
.
Eval Type | How it works | Use cases |
---|---|---|
LLM-as-judge | Uses a LLM or system of agents to evaluate and score outputs based on a criteria. | Great for subjective quality or well-defined objective assessments (tone, instruction adherence, hallucination). Poor for vague preference or subject-matter expertise. |
Code | Algorithms compute scores based on rules and patterns. | Great for reducing cost, latency, and environment-based measurements like tool execution, exceptions, etc. Poor for qualitative measurements such as summarization. |
Annotations | Humans provide custom labels on agent traces. | Great for subject matter expertise, direct application feedback, and "feels right" assessments. Poor for large scale, cost-effective, or time-sensitive evaluations. |
Building your own evals
Perhaps you're working in a novel domain, have unique task definitions, or need to evaluate agent behavior against proprietary rules. In these cases, building your own evals is the best way to ensure you're measuring what matters.
Judgment's custom evals module allows you to define:
- What counts as a success or failure, using your own criteria.
- What data to evaluate—a specific step or an entire agent trajectory.
- Whether to score results via heuristics, LLM-as-a-judge, or human annotation.
In judgeval
, you can build custom evals via:
Custom scorers: powerful & flexible, define your own scoring logic in code, with LLMs, or a combination of both.
Classifier scorers: lightweight, simple LLM-as-judge scorers that classify outputs according to natural language criteria.
What should I use evals for?
Once you've selected or built your evals, you can use them to accomplish many different goals.
Use Case | Why Use Evals This Way? |
---|---|
Unit Testing | Catch regressions early in development by testing specific agent behaviors against predefined tasks. Ensures code changes (e.g. prompt, tool, model updates) don't break existing functionality. |
Online Evals | Continuously track agent performance in real-time to alert on quality degradation, unusual patterns, or system failures and take automated actions. |
A/B Testing | Compare different agent versions or configurations to make data-driven decisions about which approach performs better on your key metrics. See how your agent is improving (or regressing) over time. |
Optimization Datasets | Create high-quality post-training data by using evals to filter and score agent outputs, which can then be used for fine-tuning or reinforcement learning. For instance, you can separate successful and failed agent traces to create datasets for supervised and reinforcement learning. |
Learn more
To learn more about implementing evals in judgeval
, check out some of our other docs on:
For a deep dive into evals, check out our feature section for evaluation.