Online Behavioral Monitoring
Run real-time checks on your agents' behavior in production.
Online behavioral monitoring lets you run systematic scorer frameworks directly on your live agents in production, alerting engineers the instant agents begin to misbehave so they can push hotfixes before customers are affected.
Quickstart
Get your agents monitored in production with server-hosted scorers - zero latency impact and secure execution.
Create your Custom Scorer
Build scoring logic to evaluate your agent's behavior. This example monitors a customer service agent to ensure it addresses package inquiries.
We've defined the scoring logic in customer_service_scorer.py
:
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI
# Define your data structure
class CustomerRequest(Example):
request: str
response: str
# Create your custom scorer
class PackageInquiryScorer(ExampleScorer):
name: str = "Package Inquiry Scorer"
server_hosted: bool = True # Enable server hosting
async def a_score_example(self, example: CustomerRequest):
client = OpenAI()
# Use LLM to evaluate if response addresses package inquiry
evaluation_prompt = f"""
Evaluate if the customer service response adequately addresses a package inquiry.
Customer request: {example.request}
Agent response: {example.response}
Does the response address package-related concerns? Answer only "YES" or "NO".
"""
completion = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": evaluation_prompt}]
)
evaluation = completion.choices[0].message.content.strip().upper()
if evaluation == "YES":
self.reason = "LLM evaluation: Response appropriately addresses package inquiry"
return 1.0
else:
self.reason = "LLM evaluation: Response doesn't adequately address package inquiry"
return 0.0
Upload your Scorer
Deploy your scorer to our secure infrastructure with a single command:
echo -e "pydantic\nopenai" > requirements.txt
uv run judgeval upload_scorer customer_service_scorer.py requirements.txt
Monitor your Agent in Production
Instrument your agent with tracing and online evaluation:
from judgeval.tracer import Tracer, wrap
from openai import OpenAI
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from customer_service_scorer import PackageInquiryScorer, CustomerRequest
judgment = Tracer(project_name="customer_service")
client = wrap(OpenAI()) # Auto-tracks all LLM calls
class CustomerServiceAgent:
@judgment.observe(span_type="tool")
def handle_request(self, request: str) -> str:
# Generate response using OpenAI
completion = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "You are a helpful customer service agent. Address customer inquiries professionally and helpfully."},
{"role": "user", "content": request}
]
)
response = completion.choices[0].message.content
# Online evaluation with server-hosted scorer
judgment.async_evaluate(
scorer=PackageInquiryScorer(),
example=CustomerRequest(request=request, response=response),
sampling_rate=0.95 # Scores 95% of agent runs
)
return response
@judgment.agent()
@judgment.observe(span_type="function")
def run(self, request: str) -> str:
return self.handle_request(request)
# Example usage
agent = CustomerServiceAgent()
result = agent.run("Where is my package? I ordered it last week.")
print(result)
Key Components:
wrap(OpenAI())
automatically tracks all LLM API calls@judgment.observe()
captures all agent interactionsjudgment.async_evaluate()
runs hosted scorers with zero latency impactsampling_rate
controls behavior scoring frequency (0.95 = 95% of requests)
Scorers can take time to execute, so they may appear slightly delayed on the UI.
You should see the online scoring results attached to the relevant trace span on the Judgment platform:

Why monitor agents in production?
Your agents evolve continuously through prompt-updates, fine-tuning, and model changes, but without real-time monitoring, behavioral drift goes unnoticed until users complain. Online monitoring solves this by:
Detecting misbehavior as it happens:
- Communication issues that introduce friction into customer interactions
- Context engineering failures during live interactions leading to incorrect answers, negatively effecting user trust
- Tool misuse with real customer data, decreasing the business value of your agents
Acting before impact spreads:
- Real-time alerts when scoring thresholds are breached
- Behavior grouping and discovery: Bucket similar failures for systematic debugging while surfacing new emergent behaviors from novel inputs, unseen context, or changing usage patterns
- Immediate visibility into which specific interactions are failing
- Production data automatically fed into your improvement loops
Advanced Features
Multi-Agent System Tracing
When working with multi-agent systems, use the @judgment.agent()
decorator to clearly identify which agents are calling methods throughout a trace.
Here's a complete multi-agent system example with a flat folder structure:
from planning_agent import PlanningAgent
if __name__ == "__main__":
planning_agent = PlanningAgent("planner-1")
goal = "Build a multi-agent system"
result = planning_agent.plan(goal)
print(result)
from judgeval.tracer import Tracer
judgment = Tracer(project_name="multi-agent-system")
from utils import judgment
from research_agent import ResearchAgent
from task_agent import TaskAgent
class PlanningAgent:
def __init__(self, id):
self.id = id
@judgment.agent()
@judgment.observe()
def plan(self, goal):
print(f"Agent {self.id} is planning for goal: {goal}")
research_agent = ResearchAgent("Researcher1")
task_agent = TaskAgent("Tasker1")
research_results = research_agent.research(goal)
task_result = task_agent.perform_task(research_results)
return f"Results from planning and executing for goal '{goal}': {task_result}"
from utils import judgment
class ResearchAgent:
def __init__(self, id):
self.id = id
@judgment.agent()
@judgment.observe()
def research(self, topic):
return f"Research notes for topic: {topic}: Findings and insights include..."
from utils import judgment
class TaskAgent:
def __init__(self, id):
self.id = id
@judgment.agent()
@judgment.observe()
def perform_task(self, task):
result = f"Performed task: {task}, here are the results: Results include..."
return result
The trace will show up in the Judgment platform clearly indicating which agent called which method:

Each agent's methods are clearly associated with their respective classes, making it easy to follow the execution flow across your multi-agent system.
Toggling Monitoring
If your setup requires you to toggle monitoring in production-level environments, you can disable monitoring by:
- Setting the
JUDGMENT_MONITORING
environment variable tofalse
(Disables tracing)
export JUDGMENT_MONITORING=false
- Setting the
JUDGMENT_EVALUATIONS
environment variable tofalse
(Disables scoring on traces)
export JUDGMENT_EVALUATIONS=false
Next steps
Alerts
Take action on your agent failures by configuring alerts triggered on your agents' behavior in production.