Evaluating AI Agents
Understanding AI Agents, and why Evaluation is Essential
AI Agent Evaluation
Building reliable AI agents is hard because of the brittle, non-deterministic multi-step nature of their executions. Poor upstream decisions can lead to downstream failures, so any agent component changes can have a cascading effect on the agent's behavior.

Agents have increased complexity because they must plan their execution, select the proper tools, and execute them in an order that is both efficient and effective. They must also reason over their state using memory and retrieval to make meaningful decisions on their execution path based on new information.
To evaluate agents, we must collect data from each component of the system and test them on top of determining whether it achieved its end goal.
Planning
When an agent receives a query, it must first determine what to do. One way is to ask it every time — use the LLM to act based on its inputs and memory.

To evaluate a planner, we need to check whether it is selecting the correct next nodes. Agents can call tools, invoke other agents, or respond directly, so different branching paths should be accounted for.
Tool Calling
Tool calling forms the core of agentic behavior, enabling LLMs to interact with the world via external APIs/processes, self-written functions, and invoke other agents. However, the flexibility of tool calling introduces failure points in the tool selection, parameter choice, and tool execution itself.

To evaluate tool calling, we need to check whether the agent is selecting the correct tools and parameters, as well as whether the tool is executed successfully.
Abilities
Abilities are specialized capabilities or modules that extend the agent's base functionality. They can be implemented as internal functions, scripts, or even as wrappers around tool-calls, but are often more tightly integrated into the agent's architecture.

Flowchart of abilities for a travel agent's itinerary generation trajectory.
You can evaluate abilities by checking the quality of the output, assuming they were invoked correctly (i.e. selected properly and called with the right input). Some examples of metrics you can use are:
- Hallucination detection for agent responses
- Instruction following for agent responses
- Comparison with ground truth responses for any task
Memory
Agent memory enables agents to retain and recall information during an interaction or across multiple trajectories. This can include user preferences, task-specific tips, or past successful runs that can help performance.
Agents can peform CRUD operations on memory during the course of an interaction. Each of these operations influences the agent's behavior and should be monitored and evaluated independently.

Tracking memory read/write operations can help you understand how your agent uses memory in response to edge cases and familiar tasks.
Reflection
After a subtask is complete or a response is generated, it can be helpful to query the agent to reflect on the output and whether it accomplished its goal. If it failed, the agent can re-attempt the task using new context informed by its original mistakes.

For example, you can use our faithfulness scorer to check if the agent's response is factually aligned with the retrieval context/memory. Another use case is leveraging the instruction adherence scorer to check if the agent's output aligns with the original task rules/instructions.