Get Started
judgeval is an Agent Behavior Monitoring (ABM) platform that catches agents misbehaving in production, enables error analysis on agent trajectories by grouping them into distinct topics, and captures the data needed for reinforcement learning.
judgeval is built and maintained by Judgment Labs.
You can follow our latest updates via GitHub.
Get Running in Under 2 Minutes
Install Judgeval
uv add judgeval
pip install judgeval
Get your API keys
Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID and set them as environment variables.
Get your free API keys
You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"
# Add to your .env file
JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"
Monitor your Agents' Behavior in Production
Online behavioral monitoring lets you run systematic scorer frameworks directly on your live agents in production, alerting engineers the instant agents begin to misbehave so they can push hotfixes before customers are affected
Create a Behavior Scorer
For production monitoring with zero latency impact, create a hosted behavior scorer that runs securely in the cloud:
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
# Define custom example class
class QuestionAnswer(Example):
question: str
answer: str
# Define a server-hosted custom scorer
class HelpfulnessScorer(ExampleScorer):
name: str = "Helpfulness Scorer"
server_hosted: bool = True # Enable server hosting
async def a_score_example(self, example: QuestionAnswer):
# Custom scoring logic for agent behavior
# Can be an arbitrary combination of code and LLM calls
if len(example.answer) > 10 and "?" not in example.answer:
self.reason = "Answer is detailed and provides helpful information"
return 1.0
else:
self.reason = "Answer is too brief or unclear"
return 0.0
Upload your Scorer
Deploy your scorer to our secure infrastructure:
echo "pydantic" > requirements.txt
uv run judgeval upload_scorer helpfulness_scorer.py requirements.txt
echo "pydantic" > requirements.txt
judgeval upload_scorer helpfulness_scorer.py requirements.txt
Monitor your Agent
Now instrument your agent with tracing and online evaluation:
from judgeval.tracer import Tracer, wrap
from helpfulness_scorer import HelpfulnessScorer, QuestionAnswer
judgment = Tracer(project_name="default_project")
@judgment.observe(span_type="tool")
def format_task(question: str) -> str:
return f"Please answer the following question: {question}"
@judgment.observe(span_type="tool")
def answer_question(prompt: str) -> str:
return "The capital of the United States is Washington, D.C.! It's located on the Potomac River and serves as the political center of the country."
@judgment.observe(span_type="function")
def run_agent(question: str) -> str:
task = format_task(question)
answer = answer_question(task)
# Add online evaluation with server-hosted scorer
judgment.async_evaluate(
scorer=HelpfulnessScorer(),
example=QuestionAnswer(question=question, answer=answer),
sampling_rate=0.9 # Evaluate 90% of agent runs
)
return answer
if __name__ == "__main__":
result = run_agent("What is the capital of the United States?")
print(result)
Congratulations! You've just created your first trace with production monitoring. It should look like this:

Key Benefits:
@judgment.observe()
captures all agent interactionsjudgment.async_evaluate()
runs hosted scorers with zero latency impactsampling_rate
controls behavior scoring frequency (0.9 = 90% of agent runs)
Regression test your Agents
Judgeval enables you to use agent-specific behavior rubrics as regression tests in your CI pipelines to stress-test agent behavior before your agent deploys into production.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer
client = JudgmentClient()
class CorrectnessExample(Example):
question: str
answer: str
class CorrectnessScorer(ExampleScorer):
name: str = "Correctness Scorer"
async def a_score_example(self, example: CorrectnessExample) -> float:
# Replace this logic with your own scoring logic
if "Washington, D.C." in example.answer:
self.reason = "The answer is correct because it contains 'Washington, D.C.'."
return 1.0
self.reason = "The answer is incorrect because it contains 'Washington, D.C.'."
return 0.0
example = CorrectnessExample(
question="What is the capital of the United States?", # Question to your agent (input to your agent!)
answer="The capital of the U.S. is Washington, D.C.", # Output from your agent (invoke your agent here!)
)
client.run_evaluation(
examples=[example],
scorers=[CorrectnessScorer()],
project_name="default_project",
)
Your test should have passed! Let's break down what happened.
question
andanswer
represent the question from the user and answer from the agent.CorrectnessScorer()
is a custom-defined scorer that statically checks if the output contains the correct answer. This scorer can be arbitrarily defined in code, including LLM-as-a-judge and any dependencies you'd like! See examples here.
Next Steps
Congratulations! You've just finished getting started with judgeval
and the Judgment Platform.
Explore our features in more detail below:

Agentic Behavior Rubrics
Measure and optimize your agent along any behaviorial rubric, using techniques such as LLM-as-a-judge and human-aligned rubrics.