Judgment Labs Logo

Using Judgeval with AI Agents

Learn how to monitor, evaluate online, and test offline your AI agents using Judgeval.

Judgeval provides powerful tools to understand, test, and improve your multi-step AI agents. Whether you're building with LangChain, LangGraph, or custom agentic frameworks, Judgeval helps you gain visibility and iterate faster. The platform is specially built for developers to iterate on agents that implement tool calling.

This guide outlines how to apply Judgeval's core capabilities—Monitoring, Online Evaluation, and Offline Evaluation—specifically to AI agents.


Monitoring Agents

Gain end-to-end observability into every step of your agent's execution, from initial input processing to tool calls, LLM interactions, and final output generation.

Key Benefits:

  • Visualize complex agent flows.
  • Identify bottlenecks and errors in specific agent tools or reasoning steps.
  • Capture rich context for debugging and analysis.

Setup: Use Judgeval's tracing @observe for Python functions/tools, wrap for LLM clients to instrument your agent's components. Traces are automatically sent to the Judgment Labs platform for visualization.

from judgeval.tracer import Tracer

judgment = Tracer(project_name="my_agent_project")

@judgment.observe(span_type="agent_tool")
def my_agent_search_tool(query: str):
    # ... actual tool logic ...
    results = f"Results for {query}"
    return results

@judgment.observe(span_type="agent_chain")
def run_agent(user_input: str):
    #... agent logic calling tools and LLMs ...
    tool_output = my_agent_search_tool(user_input)
    # ... further processing ...
    return "Final agent response"

Learn more about Monitoring & Tracing


Online Evaluation of Agents

Embed evaluations directly within your agent's workflow to get real-time feedback on the quality and safety of its outputs or intermediate steps.

Key Benefits:

  • Perform real-time safety checks (e.g., for toxicity, PII).
  • Assess the relevancy or faithfulness of tool outputs or LLM responses as they happen.
  • Trigger alerts or fallback mechanisms based on live evaluation scores.

Setup: Within your agent's tools or logic chains, use judgment.async_evaluate() with appropriate scorers.

from judgeval.data import Example
from judgeval.scorers import AnswerRelevancyScorer

@judgment.observe(span_type="agent_tool_with_eval")
def another_agent_tool(input_data: str, user_query: str):
    tool_result = f"Some output from tool based on {input_data}"
     
    eval_example = Example(input=user_query, actual_output=tool_result)
    judgment.async_evaluate(
        scorers=[AnswerRelevancyScorer(threshold=0.6)],
        example=eval_example,
        model="gpt-4.1"
    )
    return tool_result

Explore Evaluation Scorers


Offline Evaluation of Agents

Systematically test your agent's performance across a dataset of inputs and expected outcomes. The ExecutionScorer can be particularly useful here to verify that an agent follows an expected sequence of tool calls or actions.

Key Benefits:

  • Benchmark agent versions and configurations.
  • Catch regressions in agent behavior or specific tool performance.
  • Validate improvements from prompt engineering or logic changes.

Setup: Prepare a dataset of test cases (inputs, expected outputs, context). Create a task function that runs your agent for each test case. Then, use judgment_client.run_evaluation() with a suite of scorers to assess performance.

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ExecutionScorer, YourCustomAgentScorer # Example

judgment_client = JudgmentClient()

def agent_task_function(dataset_row):
    agent_input = dataset_row["input"]
    retrieval_context = dataset_row.get("retrieval_context")
    agent_response = my_full_agent_execution(agent_input, retrieval_context)
    agent_response = "Placeholder agent response for testing"
    return {
        "input": agent_input,
        "actual_output": agent_response,
        "expected_output": dataset_row.get("expected_output")
        "retrieval_context": retrieval_context
    }

examples_data = [
    {"input": "test query 1", "expected_output": "expected answer 1"},
    {"input": "test query 2", "retrieval_context": ["context A"]}
]
examples = [Example(**row) for row in examples_data]

evaluation_results = judgment_client.run_evaluation(
    project_name="my_agent_offline_evals",
    eval_run_name="Agent v1.2 Benchmark",
    examples=examples,
    task=agent_task_function,
    scorers=[ExecutionOrderScorer(expected_sequence=['tool_call_1', 'tool_call_2', 'final_answer']), YourCustomAgentScorer()],
    model="gpt-4o"
)

Read more about Offline Evaluation