Get Started

judgeval is an Agent Behavior Monitoring (ABM) library that helps track and judge any agent behavior in online and offline environments. judgeval also enables error analysis on agent trajectories and groups trajectories by behavior and topic for deeper analysis.

judgeval is built and maintained by Judgment Labs. You can follow our latest updates via GitHub.

Get Running in Under 2 Minutes

Install Judgeval

uv add judgeval

pip install judgeval

Get your API keys

Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID and set them as environment variables.

Get your free API keys

You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.

export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"

# Add to your .env file
JUDGMENT_API_KEY="your_key_here"
JUDGMENT_ORG_ID="your_org_id_here"

Monitor your Agents' Behavior in Production

Online behavioral monitoring lets you run scorers directly on your agents in production. The instant an agent misbehaves, engineers can be alerted to push a hotfix before customers are affected.

Our server-hosted scorers run in secure Firecracker microVMs with zero impact on your application latency.

Create a Behavior Scorer

First, create a hosted behavior scorer that runs securely in the cloud:

helpfulness_scorer.py

from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

# Define custom example class with any fields you want to expose to the scorer
class QuestionAnswer(Example):
    question: str
    answer: str

# Define a server-hosted custom scorer
class HelpfulnessScorer(ExampleScorer):
    name: str = "Helpfulness Scorer"
    server_hosted: bool = True  # Enable server hosting

    async def a_score_example(self, example: QuestionAnswer):
        # Custom scoring logic for agent behavior
        # Can be an arbitrary combination of code and LLM calls
        if len(example.answer) > 10 and "?" not in example.answer:
            self.reason = "Answer is detailed and provides helpful information"
            return 1.0
        else:
            self.reason = "Answer is too brief or unclear"
            return 0.0

Upload your Scorer

Deploy your scorer to our secure infrastructure:

echo "pydantic" > requirements.txt
uv run judgeval upload_scorer helpfulness_scorer.py requirements.txt

echo "pydantic" > requirements.txt
judgeval upload_scorer helpfulness_scorer.py requirements.txt

Terminal Output

2025-09-27 17:54:06 - judgeval - INFO - Auto-detected scorer name: 'Helpfulness Scorer'
2025-09-27 17:54:08 - judgeval - INFO - Successfully uploaded custom scorer: Helpfulness Scorer

Monitor Your Agent Using Custom Scorers

Now instrument your agent with tracing and online evaluation:

Note: This example uses OpenAI. Make sure you have OPENAI_API_KEY set in your environment variables before running.

monitor.py

from openai import OpenAI
from judgeval.tracer import Tracer, wrap
from helpfulness_scorer import HelpfulnessScorer, QuestionAnswer

judgment = Tracer(project_name="default_project")  # organizes traces
client = wrap(OpenAI())  # tracks all LLM calls

@judgment.observe(span_type="tool") 
def format_task(question: str) -> str:
    return f"Please answer the following question: {question}"

@judgment.observe(span_type="tool") 
def answer_question(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@judgment.observe(span_type="function") 
def run_agent(question: str) -> str:
    task = format_task(question)
    answer = answer_question(task)

    # Add online evaluation with server-hosted scorer
    judgment.async_evaluate(
        scorer=HelpfulnessScorer(),
        example=QuestionAnswer(question=question, answer=answer),
        sampling_rate=0.9  # Evaluate 90% of agent runs
    )

    return answer

if __name__ == "__main__":
    result = run_agent("What is the capital of the United States?")
    print(result)

Congratulations! You've just created your first trace with production monitoring. It should look like this:

Key Benefits:

@judgment.observe() captures all agent interactions
judgment.async_evaluate() runs hosted scorers with zero latency impact
sampling_rate controls behavior scoring frequency (0.9 = 90% of agent runs)

You can instrument Agent Behavioral Monitoring (ABM) on agents to alert when agents are misbehaving in production.

View the alerts docs for more information.

Regression test your Agents

Judgeval enables you to use agent-specific behavior rubrics as regression tests in your CI pipelines to stress-test agent behavior before your agent deploys into production.

You can run evals on predefined test examples with any of your own custom scorers. Evals produce a score for each example. You can run multiple scorers on the same example to score different aspects of quality.

eval.py

from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers.example_scorer import ExampleScorer

client = JudgmentClient()

class CorrectnessExample(Example):
    question: str
    answer: str

class CorrectnessScorer(ExampleScorer):
    name: str = "Correctness Scorer"
    async def a_score_example(self, example: CorrectnessExample) -> float:
        # Replace this logic with your own scoring logic
        if "Washington, D.C." in example.answer:
            self.reason = "The answer is correct because it contains 'Washington, D.C.'."
            return 1.0
        
        self.reason = "The answer is incorrect because it contains 'Washington, D.C.'."
        return 0.0
        

example = CorrectnessExample(
    question="What is the capital of the United States?", # Question to your agent (input to your agent!)
    answer="The capital of the U.S. is Washington, D.C.",  # Output from your agent (invoke your agent here!)
)

client.run_evaluation(
    examples=[example],
    scorers=[CorrectnessScorer()],
    project_name="default_project",
)

Your test should have passed! Let's break down what happened.

question and answer represent the question from the user and answer from the agent.
CorrectnessScorer() is a custom-defined scorer that statically checks if the output contains the correct answer. This scorer can be arbitrarily defined in code, including LLM-as-a-judge and any dependencies you'd like! See examples here.