Evaluating AI Agents

This page breaks down theoretical concepts of agent evaluation. To get started with actually running evals, check out our evaluation section!

AI Agent Evaluation

Building reliable AI agents is hard because of the brittle, non-deterministic multi-step nature of their executions. Poor upstream decisions can lead to downstream failures, so any agent component changes can have a cascading effect on the agent's behavior.

Agents have increased complexity because they must plan their execution, select the proper tools, and execute them in an order that is both efficient and effective. They must also reason over their state using memory and retrieval to make meaningful decisions on their execution path based on new information.

To evaluate agents, we must collect data from each component of the system and test them on top of determining whether it achieved its end goal.

Planning

When an agent receives a query, it must first determine what to do. One way is to ask it every time — use the LLM to act based on its inputs and memory.

This planning architecture is quite flexible. Sometimes it is also built using hardcoded rules.

To evaluate a planner, we need to check whether it is selecting the correct next nodes. Agents can call tools, invoke other agents, or respond directly, so different branching paths should be accounted for.

You will need to consider cases such as:

Does the plan include only agents/tools that are valid/available?
Single turn vs. multi-turn conversation pathways
Edge cases where the query doesn't match any available tools or actions
Priority-based routing when multiple tools could handle the same request

Judgment provides the execution order for evaluation of the planner, ensuring the agent selects the correct next nodes and remains on task through long conversations.

Tool Calling

Tool calling forms the core of agentic behavior, enabling LLMs to interact with the world via external APIs/processes, self-written functions, and invoke other agents. However, the flexibility of tool calling introduces failure points in the tool selection, parameter choice, and tool execution itself.

To evaluate tool calling, we need to check whether the agent is selecting the correct tools and parameters, as well as whether the tool is executed successfully.

You should consider cases such as:

No functions should be called, one function should be called, multiple functions should be called
Handling failed tool calls (e.g. 404, 422) vs. successful tool calls (200)
Vague parameters in query vs. specific parameters in the query
Single turn vs. multi-turn tool calling

Judgment provides the tool order scorer, which can check the selection, ordering, and parameter selection of tool calls.

Abilities

Abilities are specialized capabilities or modules that extend the agent's base functionality. They can be implemented as internal functions, scripts, or even as wrappers around tool-calls, but are often more tightly integrated into the agent's architecture.

Examples of abilities include SQL query generation, RAG, summarization, or custom logic like extracting all dates from a text.

Flowchart of abilities for a travel agent's itinerary generation trajectory.

An agent might have abilities that call external services or run locally as functions. Agents typically use them during reasoning or planning, selecting which abilities to use based on its internal rules.

You can evaluate abilities by checking the quality of the output, assuming they were invoked correctly (i.e. selected properly and called with the right input). Some examples of metrics you can use are:

Hallucination detection for agent responses
Instruction following for agent responses
Comparison with ground truth responses for any task

Memory

Agent memory enables agents to retain and recall information during an interaction or across multiple trajectories. This can include user preferences, task-specific tips, or past successful runs that can help performance.

Memory is directly embedded in the agent's context and can either remain static or be updated via retrieval methods at each step in its path.

Agents can peform CRUD operations on memory during the course of an interaction. Each of these operations influences the agent's behavior and should be monitored and evaluated independently.

Tracking memory read/write operations can help you understand how your agent uses memory in response to edge cases and familiar tasks.

You should consider test/eval cases such as:

Does your agent update its memory in response to new information?
Does your agent truncate its memory when redundant or irrelevant information is logged?
How much of the active agent memory is relevant to the current task/interaction?
Does the current context contradict the agent's previous trajectories and memories?

Reflection

After a subtask is complete or a response is generated, it can be helpful to query the agent to reflect on the output and whether it accomplished its goal. If it failed, the agent can re-attempt the task using new context informed by its original mistakes.

In practice, reflection can be accomplished through self-checking, but a common approach is to use a runtime evaluation system (which can itself be an agent) rather than post-hoc analysis.

For example, you can use our faithfulness scorer to check if the agent's response is factually aligned with the retrieval context/memory. Another use case is leveraging the instruction adherence scorer to check if the agent's output aligns with the original task rules/instructions.