Judgment Labs Logo
Evaluation

Datasets

Overview

In most scenarios, you will have multiple Examples that you want to evaluate together. Both judgeval (Python) and judgeval-js (TypeScript) provide an EvalDataset class to manage collections of Examples. These classes allow you to scale evaluations and offer similar functionalities for saving, loading, and synchronizing datasets with the Judgment platform.

Creating a Dataset

Creating an EvalDataset is straightforward in both languages. You can initialize it with a list (Python) or array (TypeScript) of Examples.

from judgeval.data import Example
from judgeval.data.datasets import EvalDataset

examples = [
    Example(input="Question 1?", actual_output="Answer 1."), 
    Example(input="Question 2?", actual_output="Answer 2."), 
    # ... more examples
]

dataset = EvalDataset(
    examples=examples
)

You can also add Examples to an existing EvalDataset.

from judgeval.data import Example
# Assume dataset = EvalDataset([...]) exists

dataset.add_example(Example(input="Question 3?", actual_output="Answer 3."))

Saving/Loading Datasets

Both libraries support saving and loading EvalDataset objects locally and interacting with the Judgment Platform.

Local Formats:

  • JSON
  • CSV
  • YAML

Remote:

  • Judgment Platform

From Judgment Platform

You can push your local EvalDataset to the Judgment platform or pull an existing one.

# Saving (Pushing)
from judgeval import JudgmentClient
from judgeval.data.datasets import EvalDataset
# Assume client = JudgmentClient() exists
# Assume dataset = EvalDataset(...) exists

client = JudgmentClient()
client.push_dataset(alias="my_dataset", dataset=dataset, project_name="my_project")

# Loading (Pulling)
# Assume client = JudgmentClient() exists
pulled_dataset = client.pull_dataset(alias="my_dataset", project_name="my_project")

From JSON

Your JSON file should have a top-level examples key containing an array of example objects (using snake_case keys).

{
    "examples": [
        {
            "input": "...", 
            "actual_output": "..."
        }, 
        ...
    ]
}

Here's how to save/load from JSON in both languages.

from judgeval.data.datasets import EvalDataset

# saving
dataset = EvalDataset(...)  # filled with examples
dataset.save_as("json", "/path/to/save/dir", "save_name")

# loading
new_dataset = EvalDataset()
new_dataset.add_from_json("/path/to/your/json/file.json")

From CSV

Your CSV should contain rows that can be mapped to Examples via column names (typically snake_case). When loading, you'll need to provide a mapping from your Example's camelCase field names to the CSV header names.

from judgeval.data.datasets import EvalDataset

# saving
dataset = EvalDataset(...)  # filled with examples
dataset.save_as("csv", "/path/to/save/dir", "save_name")

# loading
new_dataset = EvalDataset()
new_dataset.add_from_csv("/path/to/your/csv/file.csv")

From YAML

Your YAML file should have a top-level examples key containing a list of example objects (using snake_case keys).

examples:
  - input: ...
    actual_output: ...
    expected_output: ...
from judgeval.data.datasets import EvalDataset

# saving
dataset = EvalDataset(...)  # filled with examples
dataset.save_as("yaml", "/path/to/save/dir", "save_name")

# loading
new_dataset = EvalDataset()
new_dataset.add_from_yaml("/path/to/your/yaml/file.yaml")

Evaluate On Your Dataset / Examples

You can use the JudgmentClient (Python) or JudgmentClient (TypeScript) to evaluate a collection of Examples using scorers. You can pass either an EvalDataset object (Python) or an array of Example objects (TypeScript) to the respective evaluation methods.

from judgeval import JudgmentClient # Added import
from judgeval.scorers import FaithfulnessScorer # Added import
# Assume client = JudgmentClient() exists
# Assume dataset = client.pull_dataset(alias="my_dataset", project_name="my_project") exists

res = client.run_evaluation(
    examples=dataset.examples,
    scorers=[FaithfulnessScorer(threshold=0.9)],
    model="gpt-4o",
)

Exporting Datasets

You can export your datasets from the Judgment Platform UI for backup purposes or sharing with team members.

Export from Platform UI

  1. Navigate to your project in the Judgment Platform
  2. Select the dataset you want to export
  3. Click the "Download Dataset" button in the top right
  4. The dataset will be downloaded as a JSON file

Export Dataset

The exported JSON file contains the complete dataset information, including metadata and examples:

{
  "dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
  "organization_id": "0fbb0aa8-a7b3-4108-b92a-cc6c6800d825",
  "dataset_alias": "QA-Pairs",
  "comments": null,
  "source_file": null,
  "created_at": "2025-04-23T22:38:11.709763+00:00",
  "is_sequence": false,
  "examples": [
    {
      "example_id": "119ee1f6-1046-41bc-bb89-d9fc704829dd",
      "input": "How can I start meditating?",
      "actual_output": null,
      "expected_output": "Meditation is a wonderful way to relax and focus...",
      "context": null,
      "retrieval_context": null,
      "additional_metadata": {
        "synthetic": true
      },
      "tools_called": null,
      "expected_tools": null,
      "name": null,
      "created_at": "2025-04-23T23:34:33.117479+00:00",
      "dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
      "eval_results_id": null,
      "sequence_id": null,
      "sequence_order": 0
    },
    // more examples...
  ]
}

Each example in the dataset contains:

  • example_id: Unique identifier for the example
  • input: The input query or prompt
  • actual_output: The response from your agent (if any)
  • expected_output: The expected response or ground truth
  • context: Additional context for the example
  • retrieval_context: Retrieved context used for RAG systems
  • additional_metadata: Custom metadata (e.g., whether the example is synthetic)
  • tools_called: Record of tools used in the response
  • expected_tools: Expected tool calls for the example
  • created_at: Timestamp of example creation
  • sequence_order: Order in sequence (if part of a sequence)
When downloading datasets that contain sensitive information, make sure to follow your organization's data handling policies and store the exported files in secure locations.

Conclusion

Congratulations! 🎉

You've now learned how to create, save, load, and evaluate datasets using both the Python (judgeval) and TypeScript (judgeval-js) libraries.

You can also view and manage your datasets via the Judgment platform.