Quickstarts
This guide will help you get started with the essential components of Judgeval.
Installation
pip install judgeval
Our team is always making new releases of the judgeval
package!
To get the latest Python version, run pip install judgeval --upgrade
.
You can follow our latest updates via our GitHub! If you enjoy using Judgeval, consider giving us a star ⭐!
Judgment API Keys
Our API keys allow you to access the JudgmentClient
and Tracer
objects which enable you to export your agent traces and evals, manage datasets, and
visualize your agent performance on Judgment Platform.
To get your account and organization API keys, create an account for free on the Judgment Platform.
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"
The Judgeval package natively integrates tracing, tests, and evals with the Judgment Platform. You can sign up for free and view your data on our dashboards in seconds!
Tracing
Traces enable you to monitor your agent system's execution and measure telemetry such as cost, latency, and quality metrics such as tool error rates, hallucination, and task completion.
from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4.1",
messages=[{"role": "user", "content": f"{task_input}"}]
)
return res.choices[0].message.content
# Calling the observed function implicitly starts and saves the trace
main()
Congratulations! You've just created your first trace. It should look like this:
There are many benefits of monitoring your agent systems with judgeval
tracing, including:
- Debugging agent workflows in seconds with full observability
- Using production workflow data to create experimental datasets for future improvement/optimization
- Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, tool error rates, etc.)
To learn more about judgeval
's tracing module, click here.
Evaluation
You can run evals on predetermined examples with any of the judgeval
scorers or your own custom scorers.
Evals produce a score for each example and you can run multiple scorers on the same examples.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4.1",
)
print(results)
Congratulations! Your evaluation should have passed. Let's break down what happened.
- The variable
input
mimics an input andactual_output
is a placeholder for what your agent system returns based on the input. - The variable
retrieval_context
(Python) orcontext
(Typescript) represents the retrieved context from your RAG knowledge base. FaithfulnessScorer(threshold=0.5)
is a scorer that checks if the output is hallucinated relative to the retrieved context. The threshold is used in the context of unit testing.- We chose
gpt-4.1
as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
To learn more about using the Judgment Client to run evaluations, click here.
Did you know that you can also run online evals on your traces in production? Click here to learn more.
Unit Testing
Judgeval natively integrates with pytest to enable you to use evals as unit tests in your CI pipelines. Here's an example of a simple tool calling agent that we can use to test our tool calling accuracy:
from judgeval.tracer import Tracer
judgment = Tracer(project_name="my_agent")
class MyAgent: # sample agent, replace with your own
@judgment.observe(span_type="tool")
def get_attractions(self, destination: str) -> str:
"""Get attractions for a destination"""
pass
@judgment.observe(span_type="tool")
def get_weather(self, destination: str, start_date: str, end_date: str) -> str:
"""Get weather forecast for a destination"""
pass
@judgment.observe(span_type="function")
def run_agent(self, prompt: str) -> str:
"""Run the agent with the given prompt"""
pass
The code below will invoke your agent to test whether the right tools are called given the input prompt.
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import ToolOrderScorer
client = JudgmentClient()
# Define example with expected tool-calling sequence
example = Example(
input={"prompt": "What's the attraction and weather in Paris for early June?"},
expected_tools=[
{
"tool_name": "get_attractions",
"parameters": {
"destination": "Paris"
}
},
{
"tool_name": "get_weather",
"parameters": {
"destination": "Paris",
"start_date": "2025-06-01",
"end_date": "2025-06-02"
}
}
])
scorer = ToolOrderScorer(exact_match=True)
agent = MyAgent() # replace with your agent
results = client.assert_test(
examples=[example],
scorers=[scorer],
function=agent.run_agent # replace with your agent's method you want to test
)
Next Steps
Congratulations! You've just finished getting started with judgeval
and the Judgment Platform.
For a deeper dive into using judgeval
, learn more about evals, unit testing, and tracing!