Getting Started
This guide will help you learn the essential components of Judgeval.
Installation
pip install judgeval
Judgeval runs evaluations that you can manage inside the library. Additionally, you should analyze and manage your evaluations, datasets, and metrics on the natively-integrated Judgment Platform, an all-in-one suite for LLM system evaluation.
judgeval
package! To get the latest Python version, run pip install --upgrade judgeval
. To get the latest Typescript version, run npm update judgeval
. You can follow our latest updates via our GitHubJudgment API Keys
Our API keys allow you to access the JudgmentClient
and Tracer
which enable you to track your agents and run evaluations on
Judgment Labs' infrastructure, access our state-of-the-art judge models, and manage your evaluations/datasets on the Judgment Platform.
To get your account and organization API keys, create an account on the Judgment Platform.
export JUDGMENT_API_KEY="your_key_here"
export JUDGMENT_ORG_ID="your_org_id_here"
Create Your First Experiment
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import FaithfulnessScorer
client = JudgmentClient()
example = Example(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra cost.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
)
scorer = FaithfulnessScorer(threshold=0.5)
results = client.run_evaluation(
examples=[example],
scorers=[scorer],
model="gpt-4o",
)
print(results)
Congratulations! Your evaluation should have passed. Let's break down what happened.
- The variable
input
mimics a user input andactual_output
is a placeholder for what your LLM system returns based on the input. - The variable
retrieval_context
(Python) orcontext
(Typescript) represents the retrieved context from your RAG knowledge base. FaithfulnessScorer(threshold=0.5)
is a scorer that checks if the output is hallucinated relative to the retrieved context.-
The threshold is used in the context of unit testing.
-
- We chose
gpt-4o
as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
Create Your First Trace
judgeval
traces enable you to monitor your LLM systems in online development and production stages.
Traces enable you to track your LLM system's flow end-to-end and measure:
- LLM costs
- Workflow latency
- Quality metrics, such as hallucination, retrieval quality, and more.
from judgeval.common.tracer import Tracer, wrap
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"{task_input}"}]
)
return res.choices[0].message.content
# Calling the observed function implicitly starts and saves the trace
main()
Congratulations! You've just created your first trace. It should look like this:
There are many benefits of monitoring your LLM systems with judgeval
tracing, including:
- Debugging LLM workflows in seconds with full observability
- Using production workflow data to create experimental datasets for future improvement/optimization
- Tracking and creating Slack/Email alerts on any metric (e.g. latency, cost, hallucination, etc.)
judgeval
's tracing module, click here.Automatic Deep Tracing
Judgeval supports automatic deep tracing, which significantly reduces the amount of instrumentation needed in your code. With deep tracing enabled (which is the default), you only need to observe top-level functions, and all nested function calls will be automatically traced.
from judgeval.tracer import Tracer, wrap
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
# Define a function that will be automatically traced when called from main
def helper_function():
return "This will be traced automatically"
# Only need to observe the top-level function
@judgment.observe(span_type="function")
def main():
# helper_function will be automatically traced without @observe
result = helper_function()
res = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": result}]
)
return res.choices[0].message.content
main()
To disable deep tracing, initialize the tracer with deep_tracing=False
. You can still name and declare span types for each function using jdugement.observe().
Create Your First Online Evaluation
In addition to tracing, judgeval
allows you to run online evaluations on your LLM systems. This enables you to:
- Catch real-time quality regressions to take action before customers are impacted
- Gain insights into your agent performance in real-world scenarios
To run an online evaluation, you can simply add one line of code to your existing trace:
from judgeval.common.tracer import Tracer, wrap
from judgeval.scorers import AnswerRelevancyScorer
from judgeval.data import Example
from openai import OpenAI
client = wrap(OpenAI())
judgment = Tracer(project_name="my_project")
@judgment.observe(span_type="tool")
def my_tool():
return "Hello world!"
@judgment.observe(span_type="function")
def main():
task_input = my_tool()
res = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"{task_input}"}]
).choices[0].message.content
example = Example(
input=task_input,
actual_output=res
)
# In Python, this likely operates on the implicit trace context
judgment.async_evaluate(
scorers=[AnswerRelevancyScorer(threshold=0.5)],
example=example,
model="gpt-4o"
)
return res
main()
Online evaluations are automatically logged to the Judgment Platform as part of your traces. You can view them by navigating to your trace and clicking on the trace span that contains the online evaluation. If there is a quality regression, the UI will display an alert, like this:
Optimizing Your LLM System
Evaluation and monitoring are the building blocks for optimizing LLM systems. Measuring the quality of your LLM workflows allows you to compare design iterations and ultimately find the optimal set of prompts, models, RAG architectures, etc. that make your LLM excel in your production use cases.
A typical experimental setup might look like this:
- Create a new Project in the Judgment platform by either running an evaluation from the SDK or via the platform UI. This will help you keep track of all evaluations and traces for different iterations of your LLM system.
- You can create separate Experiments for different iterations of your LLM system, allowing you to independently test each component of your LLM system.
gpt-4o
, claude-3-5-sonnet
, etc.) and prompt templates in each Experiment to find the optimal setup for your LLM system. Next Steps
Congratulations! You've just finished getting started with judgeval
and the Judgment Platform.
For a deeper dive into using judgeval
, learn more about experiments, unit testing, and monitoring!