Experiment Comparisons
Learn how to A/B test changes in your LLM workflows using experiment comparisons.
Introduction
Experiment comparisons allow you to systematically A/B test changes in your LLM workflows. Whether you're testing different prompts, models, or architectures, Judgment helps you compare results across experiments to make data-driven decisions about your LLM systems.
Creating Your First Comparison
Let's walk through how to create and run experiment comparisons:
from judgeval import JudgmentClient
from judgeval.data import Example
from judgeval.scorers import AnswerCorrectnessScorer
client = JudgmentClient()
# Define your test examples
examples = [
Example(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris"
),
Example(
input="What is the capital of Japan?",
actual_output="Tokyo is the capital of Japan.",
expected_output="Tokyo"
)
]
# Define your scorer
scorer = AnswerCorrectnessScorer(threshold=0.7)
# Run first experiment with GPT-4
experiment_1 = client.run_evaluation(
examples=examples,
scorers=[scorer],
model="gpt-4",
project_name="capital_cities",
eval_name="gpt4_experiment"
)
# Run second experiment with a different model
experiment_2 = client.run_evaluation(
examples=examples,
scorers=[scorer],
model="gpt-3.5-turbo",
project_name="capital_cities",
eval_name="gpt35_experiment"
)
After running the following code, click the View Results
link to take you to your experiment run on the Judgment Platform.
Analyzing Results
Once your experiments are complete, you can compare them on the Judgment Platform:
-
You'll be automatically directed to your Experiment page. Here you'll see your latest experiment results and a "Compare" button.
-
Click the "Compare" button to navigate to the Experiments page. Here you can select a previous experiment to compare against your current results.
-
After selecting an experiment, you'll return to the Experiment page with both experiments' results displayed side by side.
-
For detailed insights, click on any row in the comparison table to see specific metrics and analysis.
Next Steps
- To learn more about creating datasets to run on your experiments, check out our Datasets section