Get Started
judgeval
is an Agent Behavior Monitoring (ABM) library that helps track and judge any agent behavior in online and offline environments.
judgeval
also enables error analysis on agent trajectories and groups trajectories by behavior and topic for deeper analysis.
judgeval is built and maintained by Judgment Labs. You can follow our latest updates via GitHub.
Get Running in Under 2 Minutes
Install Judgeval
uv add judgeval
pip install judgeval
Get your API keys
Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID and set them as environment variables.
Get your free API keys
You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"
# Add to your .env file
JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"
Monitor your Agents' Behavior in Production
Online behavioral monitoring lets you run scorers directly on your agents in production. The instant an agent misbehaves, engineers can be alerted to push a hotfix before customers are affected.
Create a Behavior Scorer
First, create a hosted behavior scorer that runs securely in the cloud:
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
# Define custom example class with any fields you want to expose to the scorer
class QuestionAnswer(Example):
question: str
answer: str
# Define a server-hosted custom scorer
class HelpfulnessScorer(ExampleScorer):
name: str = "Helpfulness Scorer"
server_hosted: bool = True # Enable server hosting
async def a_score_example(self, example: QuestionAnswer):
# Custom scoring logic for agent behavior
# Can be an arbitrary combination of code and LLM calls
if len(example.answer) > 10 and "?" not in example.answer:
self.reason = "Answer is detailed and provides helpful information"
return 1.0
else:
self.reason = "Answer is too brief or unclear"
return 0.0
Upload your Scorer
Deploy your scorer to our secure infrastructure:
echo "pydantic" > requirements.txt
uv run judgeval upload_scorer helpfulness_scorer.py requirements.txt
echo "pydantic" > requirements.txt
judgeval upload_scorer helpfulness_scorer.py requirements.txt
2025-09-27 17:54:06 - judgeval - INFO - Auto-detected scorer name: 'Helpfulness Scorer'
2025-09-27 17:54:08 - judgeval - INFO - Successfully uploaded custom scorer: Helpfulness Scorer
Monitor Your Agent Using Custom Scorers
Now instrument your agent with tracing and online evaluation:
from openai import OpenAI
from judgeval.tracer import Tracer, wrap
from helpfulness_scorer import HelpfulnessScorer, QuestionAnswer
judgment = Tracer(project_name="default_project") # organizes traces
client = wrap(OpenAI()) # tracks all LLM calls
@judgment.observe(span_type="tool")
def format_task(question: str) -> str:
return f"Please answer the following question: {question}"
@judgment.observe(span_type="tool")
def answer_question(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
@judgment.observe(span_type="function")
def run_agent(question: str) -> str:
task = format_task(question)
answer = answer_question(task)
# Add online evaluation with server-hosted scorer
judgment.async_evaluate(
scorer=HelpfulnessScorer(),
example=QuestionAnswer(question=question, answer=answer),
sampling_rate=0.9 # Evaluate 90% of agent runs
)
return answer
if __name__ == "__main__":
result = run_agent("What is the capital of the United States?")
print(result)
Congratulations! You've just created your first trace with production monitoring. It should look like this:

Key Benefits:
@judgment.observe()
captures all agent interactionsjudgment.async_evaluate()
runs hosted scorers with zero latency impactsampling_rate
controls behavior scoring frequency (0.9 = 90% of agent runs)
Regression test your Agents
Judgeval enables you to use agent-specific behavior rubrics as regression tests in your CI pipelines to stress-test agent behavior before your agent deploys into production.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CorrectnessExample(Example):
question: str
answer: str
class CorrectnessScorer(ExampleScorer):
name: str = "Correctness Scorer"
async def a_score_example(self, example: CorrectnessExample) -> float:
# Replace this logic with your own scoring logic
if "Washington, D.C." in example.answer:
self.reason = "The answer is correct because it contains 'Washington, D.C.'."
return 1.0
self.reason = "The answer is incorrect because it contains 'Washington, D.C.'."
return 0.0
example = CorrectnessExample(
question="What is the capital of the United States?", # Question to your agent (input to your agent!)
answer="The capital of the U.S. is Washington, D.C.", # Output from your agent (invoke your agent here!)
)
client.run_evaluation(
examples=[example],
scorers=[CorrectnessScorer()],
project_name="default_project",
)
Your test should have passed! Let's break down what happened.
question
andanswer
represent the question from the user and answer from the agent.CorrectnessScorer()
is a custom-defined scorer that statically checks if the output contains the correct answer. This scorer can be arbitrarily defined in code, including LLM-as-a-judge and any dependencies you'd like! See examples here.
Next Steps
Congratulations! You've just finished getting started with judgeval
and the Judgment Platform.
Explore our features in more detail below:

Agentic Behavior Rubrics
Measure and optimize your agent along any behaviorial rubric, using techniques such as LLM-as-a-judge and human-aligned rubrics.