Online Behavioral Monitoring

Online behavioral monitoring lets you run systematic scorer frameworks directly on your live agents in production, alerting engineers the instant agents begin to misbehave so they can push hotfixes before customers are affected.

Quickstart

Get your agents monitored in production with server-hosted scorers - zero latency impact and secure execution.

Create your Custom Scorer

Build scoring logic to evaluate your agent's behavior. This example monitors a customer service agent to ensure it addresses package inquiries.

We've defined the scoring logic in customer_service_scorer.py:

customer_service_scorer.py

from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI

# Define your data structure
class CustomerRequest(Example):
    request: str
    response: str

# Create your custom scorer
class PackageInquiryScorer(ExampleScorer):
    name: str = "Package Inquiry Scorer"
    server_hosted: bool = True  # Enable server hosting

    async def a_score_example(self, example: CustomerRequest):
        client = OpenAI()

        # Use LLM to evaluate if response addresses package inquiry
        evaluation_prompt = f"""
        Evaluate if the customer service response adequately addresses a package inquiry.

        Customer request: {example.request}
        Agent response: {example.response}

        Does the response address package-related concerns? Answer only "YES" or "NO".
        """

        completion = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[{"role": "user", "content": evaluation_prompt}]
        )

        evaluation = completion.choices[0].message.content.strip().upper()

        if evaluation == "YES":
            self.reason = "LLM evaluation: Response appropriately addresses package inquiry"
            return 1.0
        else:
            self.reason = "LLM evaluation: Response doesn't adequately address package inquiry"
            return 0.0

Server-hosted scorers run in secure Firecracker microVMs with zero impact on your application latency.

Upload your Scorer

Deploy your scorer to our secure infrastructure with a single command:

echo -e "pydantic\nopenai" > requirements.txt
uv run judgeval upload_scorer customer_service_scorer.py requirements.txt

Your scorer runs in its own secure sandbox. Re-upload anytime your scoring logic changes.

Monitor your Agent in Production

Instrument your agent with tracing and online evaluation:

Note: This example uses OpenAI. Make sure you have OPENAI_API_KEY set in your environment variables before running.

monitored_agent.py

from judgeval.tracer import Tracer, wrap
from openai import OpenAI
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from customer_service_scorer import PackageInquiryScorer, CustomerRequest

judgment = Tracer(project_name="customer_service")
client = wrap(OpenAI())  # Auto-tracks all LLM calls

class CustomerServiceAgent:
    @judgment.observe(span_type="tool")
    def handle_request(self, request: str) -> str:
        # Generate response using OpenAI
        completion = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[
                {"role": "system", "content": "You are a helpful customer service agent. Address customer inquiries professionally and helpfully."},
                {"role": "user", "content": request}
            ]
        )

        response = completion.choices[0].message.content

        # Online evaluation with server-hosted scorer
        judgment.async_evaluate(
            scorer=PackageInquiryScorer(),
            example=CustomerRequest(request=request, response=response),
            sampling_rate=0.95  # Scores 95% of agent runs
        )

        return response

    @judgment.agent()
    @judgment.observe(span_type="function")
    def run(self, request: str) -> str:
        return self.handle_request(request)

# Example usage
agent = CustomerServiceAgent()
result = agent.run("Where is my package? I ordered it last week.")
print(result)

Key Components:

wrap(OpenAI()) automatically tracks all LLM API calls
@judgment.observe() captures all agent interactions
judgment.async_evaluate() runs hosted scorers with zero latency impact
sampling_rate controls behavior scoring frequency (0.95 = 95% of requests)

Scorers can take time to execute, so they may appear slightly delayed on the UI.

You should see the online scoring results attached to the relevant trace span on the Judgment platform:

Why monitor agents in production?

Your agents evolve continuously through prompt-updates, fine-tuning, and model changes, but without real-time monitoring, behavioral drift goes unnoticed until users complain. Online monitoring solves this by:

Detecting misbehavior as it happens:

Communication issues that introduce friction into customer interactions
Context engineering failures during live interactions leading to incorrect answers, negatively effecting user trust
Tool misuse with real customer data, decreasing the business value of your agents

Acting before impact spreads:

Real-time alerts when scoring thresholds are breached
Behavior grouping and discovery: Bucket similar failures for systematic debugging while surfacing new emergent behaviors from novel inputs, unseen context, or changing usage patterns
Immediate visibility into which specific interactions are failing
Production data automatically fed into your improvement loops

Advanced Features

Multi-Agent System Tracing

When working with multi-agent systems, use the @judgment.agent() decorator to clearly identify which agents are calling methods throughout a trace.

Methods decorated with @judgment.agent() must also use @judgment.observe().

Here's a complete multi-agent system example with a flat folder structure:

main.py

from planning_agent import PlanningAgent

if __name__ == "__main__":
    planning_agent = PlanningAgent("planner-1")
    goal = "Build a multi-agent system"
    result = planning_agent.plan(goal)
    print(result)

utils.py

from judgeval.tracer import Tracer

judgment = Tracer(project_name="multi-agent-system")

planning_agent.py

from utils import judgment
from research_agent import ResearchAgent
from task_agent import TaskAgent

class PlanningAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent()
    @judgment.observe()
    def plan(self, goal):
        print(f"Agent {self.id} is planning for goal: {goal}")

        research_agent = ResearchAgent("Researcher1")
        task_agent = TaskAgent("Tasker1")

        research_results = research_agent.research(goal)
        task_result = task_agent.perform_task(research_results)

        return f"Results from planning and executing for goal '{goal}': {task_result}"

research_agent.py

from utils import judgment

class ResearchAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent()
    @judgment.observe()
    def research(self, topic):
        return f"Research notes for topic: {topic}: Findings and insights include..."

task_agent.py

from utils import judgment

class TaskAgent:
    def __init__(self, id):
        self.id = id

    @judgment.agent()
    @judgment.observe()
    def perform_task(self, task):
        result = f"Performed task: {task}, here are the results: Results include..."
        return result

The trace will show up in the Judgment platform clearly indicating which agent called which method:

Each agent's methods are clearly associated with their respective classes, making it easy to follow the execution flow across your multi-agent system.

Toggling Monitoring

If your setup requires you to toggle monitoring in production-level environments, you can disable monitoring by:

Setting the JUDGMENT_MONITORING environment variable to false (Disables tracing)

export JUDGMENT_MONITORING=false

Setting the JUDGMENT_EVALUATIONS environment variable to false (Disables scoring on traces)

export JUDGMENT_EVALUATIONS=false

Next steps

Alerts

Take action on your agent failures by configuring alerts triggered on your agents' behavior in production.

Online Behavioral Monitoring

Alerts

On this page