Get Started

judgeval is an Agent Behavior Monitoring (ABM) platform that catches agents misbehaving in production, enables error analysis on agent trajectories by grouping them into distinct topics, and captures the data needed for reinforcement learning.

judgeval is built and maintained by Judgment Labs.

You can follow our latest updates via GitHub.

Get Running in Under 2 Minutes

Install Judgeval

uv add judgeval

pip install judgeval

Get your API keys

Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID and set them as environment variables.

Get your free API keys

You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.

export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"

# Add to your .env file
JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"

Monitor your Agents' Behavior in Production

Online behavioral monitoring lets you run systematic scorer frameworks directly on your live agents in production, alerting engineers the instant agents begin to misbehave so they can push hotfixes before customers are affected

Create a Behavior Scorer

For production monitoring with zero latency impact, create a hosted behavior scorer that runs securely in the cloud:

helpfulness_scorer.py

from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

# Define custom example class
class QuestionAnswer(Example):
    question: str
    answer: str

# Define a server-hosted custom scorer
class HelpfulnessScorer(ExampleScorer):
    name: str = "Helpfulness Scorer"
    server_hosted: bool = True  # Enable server hosting

    async def a_score_example(self, example: QuestionAnswer):
        # Custom scoring logic for agent behavior
        # Can be an arbitrary combination of code and LLM calls
        if len(example.answer) > 10 and "?" not in example.answer:
            self.reason = "Answer is detailed and provides helpful information"
            return 1.0
        else:
            self.reason = "Answer is too brief or unclear"
            return 0.0

Upload your Scorer

Deploy your scorer to our secure infrastructure:

echo "pydantic" > requirements.txt
uv run judgeval upload_scorer helpfulness_scorer.py requirements.txt

echo "pydantic" > requirements.txt
judgeval upload_scorer helpfulness_scorer.py requirements.txt

Monitor your Agent

Now instrument your agent with tracing and online evaluation:

monitor.py

from judgeval.tracer import Tracer, wrap
from helpfulness_scorer import HelpfulnessScorer, QuestionAnswer

judgment = Tracer(project_name="default_project")

@judgment.observe(span_type="tool")
def format_task(question: str) -> str:
    return f"Please answer the following question: {question}"

@judgment.observe(span_type="tool")
def answer_question(prompt: str) -> str:
    return "The capital of the United States is Washington, D.C.! It's located on the Potomac River and serves as the political center of the country."

@judgment.observe(span_type="function")
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)

    # Add online evaluation with server-hosted scorer
    judgment.async_evaluate(
        scorer=HelpfulnessScorer(),
        example=QuestionAnswer(question=question, answer=answer),
        sampling_rate=0.9  # Evaluate 90% of agent runs
    )

    return answer

if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

Congratulations! You've just created your first trace with production monitoring. It should look like this:

Key Benefits:

@judgment.observe() captures all agent interactions
judgment.async_evaluate() runs hosted scorers with zero latency impact
sampling_rate controls behavior scoring frequency (0.9 = 90% of agent runs)

You can instrument Agent Behavioral Monitoring (ABM) on agents to alert when agents are misbehaving in production.

Regression test your Agents

Judgeval enables you to use agent-specific behavior rubrics as regression tests in your CI pipelines to stress-test agent behavior before your agent deploys into production.

You can run evals on predefined test examples with any of your own custom scorers. Evals produce a score for each example. You can run multiple scorers on the same example to score different aspects of quality.

eval.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CorrectnessExample(Example):
    question: str
    answer: str

class CorrectnessScorer(ExampleScorer):
    name: str = "Correctness Scorer"
    async def a_score_example(self, example: CorrectnessExample) -> float:
        # Replace this logic with your own scoring logic
        if "Washington, D.C." in example.answer:
            self.reason = "The answer is correct because it contains 'Washington, D.C.'."
            return 1.0
        
        self.reason = "The answer is incorrect because it contains 'Washington, D.C.'."
        return 0.0
        

example = CorrectnessExample(
    question="What is the capital of the United States?", # Question to your agent (input to your agent!)
    answer="The capital of the U.S. is Washington, D.C.",  # Output from your agent (invoke your agent here!)
)

client.run_evaluation(
    examples=[example],
    scorers=[CorrectnessScorer()],
    project_name="default_project",
)

Your test should have passed! Let's break down what happened.

question and answer represent the question from the user and answer from the agent.
CorrectnessScorer() is a custom-defined scorer that statically checks if the output contains the correct answer. This scorer can be arbitrarily defined in code, including LLM-as-a-judge and any dependencies you'd like! See examples here.