Agent Behavior Monitoring

Run real-time checks on your agents' behavior in production.

Agent behavior monitoring provides comprehensive observability and evaluation of your production agents. By combining tracing, scoring, alerting, and bucketing, you can track what your agents do, measure how well they perform, detect when they misbehave, and analyze behavior patterns - all in real-time.


Tracing

Tracing captures all agent execution data using OpenTelemetry.

  • Auto-records LLM calls, tool usage, and metrics
  • Supports multi-agent and distributed systems
Tracing

Scoring

Evaluate agent quality with built-in scorers, custom logic, or LLM-as-judge.

  • Server-hosted execution with zero latency impact
  • Scores individual interactions or full traces
Scoring

Alerts

Alerts notify your team when behavior meets specific conditions.

  • Rules with frequency thresholds and cooldowns
  • Multi-channel notifications and automatic dataset routing
Alerts

Bucketing

Bucketing automatically groups similar traces into analyzable patterns.

  • Guided rubric generation for classification
  • Real-time routing with TracePromptScorers
Bucketing

Complete Monitoring Setup

This quickstart demonstrates a complete agent behavior monitoring implementation.

Initialize Tracing

Set up tracing to capture all agent behavior. Tracing is the foundation - it records what your agent does so you can evaluate it.

agent.py
from judgeval.tracer import Tracer, wrap
from openai import OpenAI

# Initialize tracer for your project
judgment = Tracer(project_name="customer_service")

# Wrap OpenAI client to auto-trace all LLM calls
client = wrap(OpenAI())

Set your API credentials using environment variables: JUDGMENT_API_KEY, JUDGMENT_ORG_ID, and OPENAI_API_KEY.

Create a Custom Scorer

Build scoring logic to evaluate agent behavior. This scorer checks if a customer service agent properly addresses package inquiries.

customer_service_scorer.py
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI

class CustomerRequest(Example):
    request: str
    response: str

class PackageInquiryScorer(ExampleScorer):
    name: str = "Package Inquiry Scorer"
    server_hosted: bool = True

    async def a_score_example(self, example: CustomerRequest):
        client = OpenAI()

        evaluation_prompt = f"""
        Evaluate if the customer service response adequately addresses a package inquiry.

        Customer request: {example.request}
        Agent response: {example.response}

        Does the response address package-related concerns? Answer only "YES" or "NO".
        """

        completion = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[{"role": "user", "content": evaluation_prompt}]
        )

        evaluation = completion.choices[0].message.content.strip().upper()

        if evaluation == "YES":
            self.reason = "Response appropriately addresses package inquiry"
            return 1.0
        else:
            self.reason = "Response doesn't adequately address package inquiry"
            return 0.0

Deploy Server-Hosted Scorer

Upload your scorer to run in secure infrastructure with zero latency impact on your agent.

echo -e "pydantic\nopenai" > requirements.txt
judgeval upload_scorer customer_service_scorer.py requirements.txt

Server-hosted scorers run in Firecracker microVMs. Re-upload anytime with --overwrite flag.

Instrument Your Agent

Add tracing decorators and online evaluation to your agent code.

monitored_agent.py
from judgeval.tracer import Tracer, wrap
from openai import OpenAI
from customer_service_scorer import PackageInquiryScorer, CustomerRequest

judgment = Tracer(project_name="customer_service")
client = wrap(OpenAI())

class CustomerServiceAgent:
    @judgment.observe(span_type="tool")
    def handle_request(self, request: str) -> str:
        completion = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[
                {"role": "system", "content": "You are a helpful customer service agent."},
                {"role": "user", "content": request}
            ]
        )

        response = completion.choices[0].message.content

        # Evaluate behavior with server-hosted scorer
        judgment.async_evaluate(
            scorer=PackageInquiryScorer(),
            example=CustomerRequest(request=request, response=response),
            sampling_rate=0.95
        )

        return response

    @judgment.agent()
    @judgment.observe(span_type="function")
    def run(self, request: str) -> str:
        return self.handle_request(request)

agent = CustomerServiceAgent()
result = agent.run("Where is my package? I ordered it last week.")
print(result)

What's happening:

  • @judgment.observe() captures execution traces
  • wrap(OpenAI()) auto-tracks LLM calls
  • judgment.async_evaluate() scores behavior in real-time
  • sampling_rate=0.95 evaluates 95% of requests

View traces with scoring results in the Judgment platform:

Trace with online evaluation

Configure Alerts

Set up a rule with alerts to notify your team when the scorer detects problematic behavior.

Navigate to the Rules page in your project dashboard and create a rule:

Basic Configuration:

  • Rule Name: Package Inquiry Failures
  • Description: Alert when agent fails to address package inquiries

Conditions:

  • Metric: Package Inquiry Scorer
  • Operator: fails (or < 0.5)
  • Alert Frequency: 3 times in 5 minutes
  • Cooldown: 30 minutes

Actions:

  • Add to dataset: package-inquiry-failures
  • Send Slack notification
  • Email: your email address

Set Up Bucketing

Configure bucketing to automatically group similar behavioral patterns.

Navigate to Bucketing in the sidebar and create a configuration:

  1. Describe your agent's purpose to improve bucketing accuracy
  2. Create a bucketing configuration with a rubric-based classifier
  3. Review sample traces and mark them as Accept/Reject
  4. Refine the rubric based on suggested improvements
  5. Finalize and name your configuration

Bucketing automatically routes matching traces to datasets for pattern analysis.

Bucketing configurations

Your agent now has complete behavior monitoring: tracing captures what it does, scoring evaluates quality, alerts notify about issues, and bucketing groups patterns for analysis.


Advanced Features

Multi-Agent System Monitoring

When monitoring multi-agent systems, behavior tracking becomes more complex as multiple agents coordinate and interact. The @judgment.agent() decorator identifies which agent is responsible for each action, enabling you to:

  • Track agent-specific behavior patterns
  • Identify which agent caused failures
  • Measure individual agent performance
  • Analyze inter-agent communication

Only decorate the entry point method of each agent with @judgment.agent() and @judgment.observe(). Other methods within the same agent only need @judgment.observe().

Here's a complete multi-agent monitoring example:

main.py
from planning_agent import PlanningAgent

if __name__ == "__main__":
    planning_agent = PlanningAgent("planner-1")
    goal = "Build a multi-agent system"
    result = planning_agent.plan(goal)
    print(result)
utils.py
from judgeval.tracer import Tracer

judgment = Tracer(project_name="multi-agent-system")
planning_agent.py
from utils import judgment
from research_agent import ResearchAgent
from task_agent import TaskAgent

class PlanningAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent() # Only add @judgment.agent() to the entry point function of the agent
    @judgment.observe()
    def invoke_agent(self, goal):
        print(f"Agent {self.id} is planning for goal: {goal}")

        research_agent = ResearchAgent("Researcher1")
        task_agent = TaskAgent("Tasker1")

        research_results = research_agent.invoke_agent(goal)
        task_result = task_agent.invoke_agent(research_results)

        return f"Results from planning and executing for goal '{goal}': {task_result}"

    @judgment.observe() # No need to add @judgment.agent() here
    def random_tool(self):
        pass
research_agent.py
from utils import judgment

class ResearchAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent()
    @judgment.observe()
    def invoke_agent(self, topic):
        return f"Research notes for topic: {topic}: Findings and insights include..."
task_agent.py
from utils import judgment

class TaskAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent()
    @judgment.observe()
    def invoke_agent(self, task):
        result = f"Performed task: {task}, here are the results: Results include..."
        return result

The trace clearly shows which agent performed each action:

Trace with agent names

Each agent's actions are clearly labeled, making it easy to track execution flow and identify which agent caused issues. You can then:

  • Create agent-specific scorers to evaluate individual agent behavior
  • Set up alerts targeting specific agents
  • Use bucketing to group traces by agent behavior patterns

Toggling Monitoring

Control monitoring at runtime using environment variables:

Disable all tracing:

export JUDGMENT_MONITORING=false

Disable scoring only (traces still captured):

export JUDGMENT_EVALUATIONS=false

This is useful for:

  • Development environments where monitoring isn't needed
  • High-volume periods where you want to reduce overhead
  • Debugging scenarios where you need clean execution
  • Cost optimization during testing

Next Steps

Explore each component of agent behavior monitoring in detail:

  • Tracing - Deep dive into OpenTelemetry-based tracing, distributed tracing, and advanced instrumentation
  • Rules and Alerts - Configure sophisticated alerting rules with frequency thresholds, cooldowns, and multi-channel notifications
  • Bucketing - Set up automated behavior classification and pattern analysis
  • Custom Scorers - Build advanced scoring logic for your specific use cases
  • Prompt Scorers - Use LLM-as-judge patterns for flexible evaluation