Datasets
Overview
In most scenarios, you will have multiple Example
s that you want to evaluate together. Both judgeval
(Python) and judgeval-js
(TypeScript) provide an EvalDataset
class to manage collections of Example
s. These classes allow you to scale evaluations and offer similar functionalities for saving, loading, and synchronizing datasets with the Judgment platform.
Creating a Dataset
Creating an EvalDataset
is straightforward in both languages. You can initialize it with a list (Python) or array (TypeScript) of Example
s.
from judgeval.data import Example
from judgeval.data.datasets import EvalDataset
examples = [
Example(input="Question 1?", actual_output="Answer 1."),
Example(input="Question 2?", actual_output="Answer 2."),
# ... more examples
]
dataset = EvalDataset(
examples=examples
)
You can also add Example
s to an existing EvalDataset
.
from judgeval.data import Example
# Assume dataset = EvalDataset([...]) exists
dataset.add_example(Example(input="Question 3?", actual_output="Answer 3."))
Saving/Loading Datasets
Both libraries support saving and loading EvalDataset
objects locally and interacting with the Judgment Platform.
Local Formats:
- JSON
- CSV
- YAML
Remote:
- Judgment Platform
From Judgment Platform
You can push your local EvalDataset
to the Judgment platform or pull an existing one.
# Saving (Pushing)
from judgeval import JudgmentClient
from judgeval.data.datasets import EvalDataset
# Assume client = JudgmentClient() exists
# Assume dataset = EvalDataset(...) exists
client = JudgmentClient()
client.push_dataset(alias="my_dataset", dataset=dataset, project_name="my_project")
# Loading (Pulling)
# Assume client = JudgmentClient() exists
pulled_dataset = client.pull_dataset(alias="my_dataset", project_name="my_project")
From JSON
Your JSON file should have a top-level examples
key containing an array of example objects (using snake_case keys).
{
"examples": [
{
"input": "...",
"actual_output": "..."
},
...
]
}
Here's how to save/load from JSON in both languages.
from judgeval.data.datasets import EvalDataset
# saving
dataset = EvalDataset(...) # filled with examples
dataset.save_as("json", "/path/to/save/dir", "save_name")
# loading
new_dataset = EvalDataset()
new_dataset.add_from_json("/path/to/your/json/file.json")
From CSV
Your CSV should contain rows that can be mapped to Example
s via column names (typically snake_case). When loading, you'll need to provide a mapping from your Example
's camelCase field names to the CSV header names.
from judgeval.data.datasets import EvalDataset
# saving
dataset = EvalDataset(...) # filled with examples
dataset.save_as("csv", "/path/to/save/dir", "save_name")
# loading
new_dataset = EvalDataset()
new_dataset.add_from_csv("/path/to/your/csv/file.csv")
From YAML
Your YAML file should have a top-level examples
key containing a list of example objects (using snake_case keys).
examples:
- input: ...
actual_output: ...
expected_output: ...
from judgeval.data.datasets import EvalDataset
# saving
dataset = EvalDataset(...) # filled with examples
dataset.save_as("yaml", "/path/to/save/dir", "save_name")
# loading
new_dataset = EvalDataset()
new_dataset.add_from_yaml("/path/to/your/yaml/file.yaml")
Evaluate On Your Dataset / Examples
You can use the JudgmentClient
(Python) or JudgmentClient
(TypeScript) to evaluate a collection of Example
s using scorers. You can pass either an EvalDataset
object (Python) or an array of Example
objects (TypeScript) to the respective evaluation methods.
from judgeval import JudgmentClient # Added import
from judgeval.scorers import FaithfulnessScorer # Added import
# Assume client = JudgmentClient() exists
# Assume dataset = client.pull_dataset(alias="my_dataset", project_name="my_project") exists
res = client.run_evaluation(
examples=dataset.examples,
scorers=[FaithfulnessScorer(threshold=0.9)],
model="gpt-4o",
)
Exporting Datasets
You can export your datasets from the Judgment Platform UI for backup purposes or sharing with team members.
Export from Platform UI
- Navigate to your project in the Judgment Platform
- Select the dataset you want to export
- Click the "Download Dataset" button in the top right
- The dataset will be downloaded as a JSON file
The exported JSON file contains the complete dataset information, including metadata and examples:
{
"dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
"organization_id": "0fbb0aa8-a7b3-4108-b92a-cc6c6800d825",
"dataset_alias": "QA-Pairs",
"comments": null,
"source_file": null,
"created_at": "2025-04-23T22:38:11.709763+00:00",
"is_sequence": false,
"examples": [
{
"example_id": "119ee1f6-1046-41bc-bb89-d9fc704829dd",
"input": "How can I start meditating?",
"actual_output": null,
"expected_output": "Meditation is a wonderful way to relax and focus...",
"context": null,
"retrieval_context": null,
"additional_metadata": {
"synthetic": true
},
"tools_called": null,
"expected_tools": null,
"name": null,
"created_at": "2025-04-23T23:34:33.117479+00:00",
"dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
"eval_results_id": null,
"sequence_id": null,
"sequence_order": 0
},
// more examples...
]
}
Each example in the dataset contains:
example_id
: Unique identifier for the exampleinput
: The input query or promptactual_output
: The response from your agent (if any)expected_output
: The expected response or ground truthcontext
: Additional context for the exampleretrieval_context
: Retrieved context used for RAG systemsadditional_metadata
: Custom metadata (e.g., whether the example is synthetic)tools_called
: Record of tools used in the responseexpected_tools
: Expected tool calls for the examplecreated_at
: Timestamp of example creationsequence_order
: Order in sequence (if part of a sequence)
Conclusion
Congratulations! 🎉
You've now learned how to create, save, load, and evaluate datasets using both the Python (judgeval
) and TypeScript (judgeval-js
) libraries.
You can also view and manage your datasets via the Judgment platform.