Why Evaluate AI Agents?
Understanding why Evaluation is Essential for Non-Determinstic and Stateful AI Agents
AI Agent Evaluation
AI agents are non-deterministic and stateful systems that present unique evaluation challenges:
Non-deterministic behavior means agents make dynamic decisions at each step:
- Which tools to call from available options
- When to retrieve or update memory
- How to route between different execution paths
Stateful behavior means agents maintain and evolve context over time:
- Short-term memory: Conversation history and task context within a session
- Long-term memory: User preferences, past interactions, and learned patterns across sessions
Building reliable AI agents is hard because of the brittle, non-deterministic multi-step nature of their executions. Poor upstream decisions can lead to downstream failures, so any agent component changes can have a cascading effect on the agent's behavior.

Agents have increased complexity because they must plan their execution, select the proper tools, and execute them in an order that is both efficient and effective. They must also reason over their state using memory and retrieval to make meaningful decisions on their execution path based on new information.
Agent Planning
When an agent receives a query, it must first determine what to do. One way is to ask it every time — use the LLM to act based on its inputs and memory.

To evaluate a planner, we need to check whether it is selecting the correct next nodes. Agents can call tools, invoke other agents, or respond directly, so different branching paths should be accounted for.
Tool Calling
Tool calling forms the core of agentic behavior, enabling LLMs to interact with the world via external APIs/processes, self-written functions, and invoke other agents. However, the flexibility of tool calling introduces failure points in the tool selection, parameter choice, and tool execution itself.

To evaluate tool calling, we need to check whether the agent is selecting the correct tools and parameters, as well as whether the tool is executed successfully.
Agent Abilities
Abilities are specialized capabilities or modules that extend the agent's base functionality. They can be implemented as internal functions, scripts, or even as wrappers around tool-calls, but are often more tightly integrated into the agent's architecture.

Flowchart of abilities for a travel agent's itinerary generation trajectory.
Agent Memory
Agent memory enables agents to retain and recall information during an interaction or across multiple trajectories. This can include user preferences, task-specific tips, or past successful runs that can help performance.
Agents can peform CRUD operations on memory during the course of an interaction. Each of these operations influences the agent's behavior and should be monitored and evaluated independently.

Tracking memory read/write operations can help you understand how your agent uses memory in response to edge cases and familiar tasks.
Agentic Reflection
After a subtask is complete or a response is generated, it can be helpful to query the agent to reflect on the output and whether it accomplished its goal. If it failed, the agent can re-attempt the task using new context informed by its original mistakes.
