Datasets
Group examples and traces for scalable evaluation workflows.
Datasets group examples for scalable evaluation workflows. Use them to manage example collections, drive offline tests, and sync test data with the Judgment platform for team collaboration.
Every dataset is schema-enforced: it has a JSON Schema and every example is validated against it. Fields are typed (string, number, boolean, …), and a column may be declared {"type": "trace"} to hold a trace id rather than literal data — that's how a dataset references the traces a judge should score.

Create using the Judgment Platform
Navigate to Datasets
Go to the Datasets tab in the sidebar.
Create your dataset
Click the New Dataset button, give it a name, and define its schema — the typed fields every example must have. Add a column of type trace if your examples should reference a trace for judges to score.


Add examples
Add examples that conform to the schema — from the Datasets page, from the Traces page in Monitoring (to capture a trace into a trace-typed column), or programmatically via the SDK.
Use datasets in offline tests
A dataset is the input to an offline test: pair it with judges in a test config, then run it to score every example — optionally running your agent fresh on each one. A trace-typed column lets the judges score an existing trace per example.
See Offline Testing for the full workflow.
Next Steps
- SDK Reference - Complete API documentation for managing datasets programmatically
- Behaviors - Automatically tag traces based on agent behavior
- Code Judges (Custom Scorers) - Create custom evaluation logic for your datasets