Datasets

Group examples and traces for scalable evaluation workflows.

Datasets group multiple examples or traces for scalable evaluation workflows. Use datasets to manage example collections, run batch evaluations, and sync your test data with the Judgment platform for team collaboration.

Datasets

Manage using the SDK

You can create and manage datasets via the Python SDK, supporting functionality for creating, retrieving, adding examples, and exporting datasets.

Create using the Judgment Platform

Go to the Datasets tab in the sidebar.

Create your dataset

Click the New Dataset button and select the data type to store:

  • Example datasets store key-value data pairs (e.g. input and output)
  • Trace datasets store full trace data
Datasets

Add data to the dataset

  • You can add data to Example datasets from:

    • Test page (Example Test type)
  • You can add data to Trace datasets from:

    • Test page (Trace Test type)
    • Traces page in Monitoring

Exporting Datasets

You can export your datasets from the Judgment Platform UI for backup purposes, sharing with team members, or publishing to HuggingFace Hub.

Export to HuggingFace

You can export your datasets directly to HuggingFace Hub by configuring the HUGGINGFACE_ACCESS_TOKEN secret in your organization settings.

Steps to set up HuggingFace export:

  1. Navigate to your organization's [Settings > Secrets]
  2. Find the HUGGINGFACE_ACCESS_TOKEN secret and click the edit icon

HuggingFace Token Configuration

  1. Enter your HuggingFace access token
  2. Once configured, navigate to your dataset in the platform
  3. Click the "Export Dataset to HF" button in the top right to export your dataset to HuggingFace Hub

Export Dataset to HuggingFace

You can generate a HuggingFace access token from your HuggingFace settings. Make sure the token has write permissions to create and update datasets.

Next Steps

  • SDK Reference - Complete API documentation for managing datasets programmatically
  • Behaviors - Automatically tag traces based on agent behavior
  • Custom Scorers - Create custom evaluation logic for your datasets