Using Judgeval with AI Agents
Learn how to monitor, evaluate online, and test offline your AI agents using Judgeval.
Judgeval provides powerful tools to understand, test, and improve your multi-step AI agents. Whether you're building with LangChain, LangGraph, or custom agentic frameworks, Judgeval helps you gain visibility and iterate faster. The platform is specially built for developers to iterate on agents that implement tool calling.
This guide outlines how to apply Judgeval's core capabilities—Monitoring, Online Evaluation, and Offline Evaluation—specifically to AI agents.
Monitoring Agents
Gain end-to-end observability into every step of your agent's execution, from initial input processing to tool calls, LLM interactions, and final output generation.
Key Benefits:
- Visualize complex agent flows.
- Identify bottlenecks and errors in specific agent tools or reasoning steps.
- Capture rich context for debugging and analysis.
Setup:
Use Judgeval's tracing @observe
for Python functions/tools, wrap
for LLM clients to instrument your agent's components. Traces are automatically sent to the Judgment Labs platform for visualization.
from judgeval.tracer import Tracer
judgment = Tracer(project_name="my_agent_project")
@judgment.observe(span_type="agent_tool")
def my_agent_search_tool(query: str):
# ... actual tool logic ...
results = f"Results for {query}"
return results
@judgment.observe(span_type="agent_chain")
def run_agent(user_input: str):
#... agent logic calling tools and LLMs ...
tool_output = my_agent_search_tool(user_input)
# ... further processing ...
return "Final agent response"
Learn more about Monitoring & Tracing
Online Evaluation of Agents
Embed evaluations directly within your agent's workflow to get real-time feedback on the quality and safety of its outputs or intermediate steps.
Key Benefits:
- Perform real-time safety checks (e.g., for toxicity, PII).
- Assess the relevancy or faithfulness of tool outputs or LLM responses as they happen.
- Trigger alerts or fallback mechanisms based on live evaluation scores.
Setup:
Within your agent's tools or logic chains, use judgment.async_evaluate()
with appropriate scorers.
from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer
@judgment.observe(span_type="agent_tool_with_eval")
def another_agent_tool(input_data: str, user_query: str):
tool_result = f"Some output from tool based on {input_data}"
eval_example = Example(input=user_query, actual_output=tool_result)
judgment.async_evaluate(
scorers=[AnswerRelevancyScorer(threshold=0.6)],
example=eval_example,
model="gpt-4.1"
)
return tool_result
Offline Evaluation of Agents
Systematically test your agent's performance across a dataset of inputs and expected outcomes. The ExecutionScorer
can be particularly useful here to verify that an agent follows an expected sequence of tool calls or actions.
Key Benefits:
- Benchmark agent versions and configurations.
- Catch regressions in agent behavior or specific tool performance.
- Validate improvements from prompt engineering or logic changes.
Setup:
Prepare a dataset of test cases (inputs, expected outputs, context). Create a task function that runs your agent for each test case. Then, use judgment_client.run_evaluation()
with a suite of scorers to assess performance.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ExecutionScorer, YourCustomAgentScorer # Example
judgment_client = JudgmentClient()
def agent_task_function(dataset_row):
agent_input = dataset_row["input"]
retrieval_context = dataset_row.get("retrieval_context")
agent_response = my_full_agent_execution(agent_input, retrieval_context)
agent_response = "Placeholder agent response for testing"
return {
"input": agent_input,
"actual_output": agent_response,
"expected_output": dataset_row.get("expected_output")
"retrieval_context": retrieval_context
}
examples_data = [
{"input": "test query 1", "expected_output": "expected answer 1"},
{"input": "test query 2", "retrieval_context": ["context A"]}
]
examples = [Example(**row) for row in examples_data]
evaluation_results = judgment_client.run_evaluation(
project_name="my_agent_offline_evals",
eval_run_name="Agent v1.2 Benchmark",
examples=examples,
task=agent_task_function,
scorers=[ExecutionOrderScorer(expected_sequence=['tool_call_1', 'tool_call_2', 'final_answer']), YourCustomAgentScorer()],
model="gpt-4o"
)