Get Started
judgeval is a post-building library for AI agents that comes with everything you need to trace and test your agents with evals.
judgeval is built and maintained by Judgment Labs.
Get your free Judgment API keys
Head to the Judgment Platform and create an account. Then, copy your API key and Organization ID into your environment variables.
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"
Get your free API keys
You get 50,000 free trace spans and 1,000 free evals each month. No credit card required.
Install Judgeval
pip install judgeval
You can follow our latest updates via GitHub! If you enjoy using Judgeval, consider giving us a star ⭐!
Start Tracing with Judgeval
Tracing starts with wrapping your LLM client with wrap()
and observing your tools with @judgment.observe
.
from judgeval.tracer import Tracer, wrap
from openai import OpenAI
client = wrap(OpenAI()) # tracks all LLM calls
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def format_question(question: str) -> str:
# dummy tool
return f"Question : {question}"
@judgment.observe(span_type="function")
def run_agent(prompt: str) -> str:
task = format_question(prompt)
response = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": task}]
)
return response.choices[0].message.content
run_agent("What is the capital of the United States?")
Congratulations! You've just created your first trace. It should look like this:
To learn more about using judgeval
's tracing module, click here.
Start Testing with Judgeval
Judgeval enables you to use evals as unit tests in your CI pipelines.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
task = "What is the capital of the United States?"
example = Example(
input=task,
actual_output=run_agent(task), # e.g. "The capital of the U.S. is Washington, D.C."
retrieval_context=["Washington D.C. was founded in 1790 and became the capital of the U.S."],
)
scorer = FaithfulnessScorer(threshold=0.5)
client.assert_test(
examples=[example],
scorers=[scorer],
model="gpt-4.1",
)
Your test should have passed! Let's break down what happened.
input
andactual_output
represent what your agent system begins with and returns.retrieval_context
represents the retrieved context from your knowledge sources.FaithfulnessScorer(threshold=0.5)
is a scorer that checks if the output is hallucinated relative to the retrieved context. Thresholds are used in the context of unit testing.- We chose
gpt-4.1
as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs.
Next Steps
Congratulations! You've just finished getting started with judgeval
and the Judgment Platform.
For a deeper dive into using judgeval
, learn more about our offerings!

Tracing
Observe your agent's inputs/outputs, tool calls, and LLM calls to debug your agent runs.
Unit Testing
Test your agent against predefined examples to catch regressions before they reach production.
Evaluation
Measure and optimize your agent along any quality metric, from hallucinations to tool-calling accuracy.