Agent Behavioral Monitoring
Run real-time checks on your agents' behavior in production.
Agent behavioral monitoring (ABM) lets you run systematic scorer frameworks directly on your live agents in production, alerting engineers the instant agents begin to misbehave so they can push hotfixes before customers are affected.
Quickstart
Get your agents monitored in production with server-hosted scorers - zero latency impact and secure execution.
Create your Custom Scorer
Build scoring logic to evaluate your agent's behavior. This example monitors a customer service agent to ensure it addresses package inquiries.
We've defined the scoring logic in customer_service_scorer.py
:
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from openai import OpenAI
# Define your data structure
class CustomerRequest(Example):
request: str
response: str
# Create your custom scorer
class PackageInquiryScorer(ExampleScorer):
name: str = "Package Inquiry Scorer"
server_hosted: bool = True # Enable server hosting
async def a_score_example(self, example: CustomerRequest):
client = OpenAI()
# Use LLM to evaluate if response addresses package inquiry
evaluation_prompt = f"""
Evaluate if the customer service response adequately addresses a package inquiry.
Customer request: {example.request}
Agent response: {example.response}
Does the response address package-related concerns? Answer only "YES" or "NO".
"""
completion = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": evaluation_prompt}]
)
evaluation = completion.choices[0].message.content.strip().upper()
if evaluation == "YES":
self.reason = "LLM evaluation: Response appropriately addresses package inquiry"
return 1.0
else:
self.reason = "LLM evaluation: Response doesn't adequately address package inquiry"
return 0.0
Upload your Scorer
Deploy your scorer to our secure infrastructure with a single command:
echo -e "pydantic\nopenai" > requirements.txt
uv run judgeval upload_scorer customer_service_scorer.py requirements.txt
echo -e "pydantic\nopenai" > requirements.txt
judgeval upload_scorer customer_service_scorer.py requirements.txt
Monitor your Agent in Production
Instrument your agent with tracing and online evaluation:
from judgeval.tracer import Tracer, wrap
from openai import OpenAI
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
from customer_service_scorer import PackageInquiryScorer, CustomerRequest
judgment = Tracer(project_name="customer_service")
client = wrap(OpenAI()) # Auto-tracks all LLM calls
class CustomerServiceAgent:
@judgment.observe(span_type="tool")
def handle_request(self, request: str) -> str:
# Generate response using OpenAI
completion = client.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": "You are a helpful customer service agent. Address customer inquiries professionally and helpfully."},
{"role": "user", "content": request}
]
)
response = completion.choices[0].message.content
# Online evaluation with server-hosted scorer
judgment.async_evaluate(
scorer=PackageInquiryScorer(),
example=CustomerRequest(request=request, response=response),
sampling_rate=0.95 # Scores 95% of agent runs
)
return response
@judgment.agent()
@judgment.observe(span_type="function")
def run(self, request: str) -> str:
return self.handle_request(request)
# Example usage
agent = CustomerServiceAgent()
result = agent.run("Where is my package? I ordered it last week.")
print(result)
Key Components:
wrap(OpenAI())
automatically tracks all LLM API calls@judgment.observe()
captures all agent interactionsjudgment.async_evaluate()
runs hosted scorers with zero latency impactsampling_rate
controls behavior scoring frequency (0.95 = 95% of requests)
Scorers can take time to execute, so they may appear slightly delayed on the UI.
You should see the online scoring results attached to the relevant trace span on the Judgment platform:

Advanced Features
Multi-Agent System Tracing
When working with multi-agent systems, use the @judgment.agent()
decorator to track which agent is responsible for each tool call in your trace.
Here's a complete multi-agent system example with a flat folder structure:
from planning_agent import PlanningAgent
if __name__ == "__main__":
planning_agent = PlanningAgent("planner-1")
goal = "Build a multi-agent system"
result = planning_agent.plan(goal)
print(result)
from judgeval.tracer import Tracer
judgment = Tracer(project_name="multi-agent-system")
from utils import judgment
from research_agent import ResearchAgent
from task_agent import TaskAgent
class PlanningAgent:
def __init__(self, id):
self.id = id
@judgment.agent() # Only add @judgment.agent() to the entry point function of the agent
@judgment.observe()
def invoke_agent(self, goal):
print(f"Agent {self.id} is planning for goal: {goal}")
research_agent = ResearchAgent("Researcher1")
task_agent = TaskAgent("Tasker1")
research_results = research_agent.invoke_agent(goal)
task_result = task_agent.invoke_agent(research_results)
return f"Results from planning and executing for goal '{goal}': {task_result}"
@judgment.observe() # No need to add @judgment.agent() here
def random_tool(self):
pass
from utils import judgment
class ResearchAgent:
def __init__(self, id):
self.id = id
@judgment.agent()
@judgment.observe()
def invoke_agent(self, topic):
return f"Research notes for topic: {topic}: Findings and insights include..."
from utils import judgment
class TaskAgent:
def __init__(self, id):
self.id = id
@judgment.agent()
@judgment.observe()
def invoke_agent(self, task):
result = f"Performed task: {task}, here are the results: Results include..."
return result
The trace will show up in the Judgment platform clearly indicating which agent called which method:

Each agent's tool calls are clearly associated with their respective classes, making it easy to follow the execution flow across your multi-agent system.
Toggling Monitoring
If your setup requires you to toggle monitoring intermittently, you can disable monitoring by:
- Setting the
JUDGMENT_MONITORING
environment variable tofalse
(Disables tracing)
export JUDGMENT_MONITORING=false
- Setting the
JUDGMENT_EVALUATIONS
environment variable tofalse
(Disables scoring on traces)
export JUDGMENT_EVALUATIONS=false
Next steps
Alerts
Take action on your agent failures by configuring alerts triggered on your agents' behavior in production.