Datasets¶

Open Datasets in the sidebar to list, create, and configure data for Advanced Training projects. Classic RL projects use gym environments instead and do not use this area.

Categories and types¶

Click New dataset to open a modal with Dataset, then Files (and, for non-tabular, a Targets step then Feature Mapping). On the first step you choose Category, then Type:

Category	Types	Typical use
Language	Reasoning, Preference, SFT	LLM-style training with column-mapped text
Other	Tabular, Non-tabular	Supervised learning on tables or file trees (images, folders)

Language types accept a CSV upload or a Hugging Face import. Tabular and non-tabular use CSV or Hugging Face for tabular data; non-tabular also lets you upload a directory and map features.

A note under the type switcher on Language: Preference datasets can also be used for SFT.

Tabular and non-tabular access¶

Tabular and Non-tabular sit under category Other. Both require an Enterprise plan on the organization.

Create — On New dataset, category Other is disabled unless the organization is on an Enterprise plan. Language types stay available on plans that include Advanced Training.
Dataset detail — The Preprocessing tab is shown for tabular and non-tabular datasets only on Enterprise.
Experiments — Advanced Training experiments that use tabular or non-tabular datasets require Enterprise. Reasoning, preference, and SFT datasets do not.

See Plan permissions for the full matrix.

Dataset detail tabs¶

After you create a dataset, its page has:

Tab	When it appears
Data	Always
Preprocessing	Tabular or non-tabular (Enterprise plan only)

The default tab is Data.

Create flow¶

Dataset — Name (required), description, Category, Type. For Other types, pick Task Type (regression, binary classification, multiclass classification; object detection only for non-tabular).
Files — Language and tabular: upload a CSV or search Hugging Face, pick a config, and confirm. Non-tabular: upload your whole directory in one go — feature folders and targets together.
Targets (non-tabular only) — Select which of the directories you uploaded on the Files step holds your labels/targets: a folder of .tiff masks for object detection, or a single .parquet file (id plus a target/label column) for the other task types. The folder/file can have any name — preprocessing uses your selection, not the name. Arena pre-selects the most likely one, and you confirm the label column here.
Feature Mapping (non-tabular only) — Choose a supported model and map the remaining directories to model features (the targets you selected are excluded automatically), plus optional encoder options.

Non-tabular directory uploads can continue in the background; progress shows in the global upload indicator while you work elsewhere.

Data tab by type¶

Each type has its own column or file UI on Data:

Reasoning — Question and Answer
Preference — Prompt, Chosen, Rejected
SFT — Prompt and Target
Tabular — Inputs (one or more) and Target, plus Task Type
Non-tabular — file browser, feature mapping, supported model

Save your mappings before you start an experiment that depends on them.

Preprocessing tab¶

For tabular and non-tabular datasets (with the right plan), open the Preprocessing tab and click Preprocess dataset. In the modal, write Encoder Code, pick an Encoder class, and submit with Run Preprocessing. Tabular runs need input and target columns chosen on Data first; Preprocess dataset stays disabled until they are set.

Job status and charts update on the tab while a job runs.

Experiments¶

In an Advanced Training project, the experiment wizard Environment step branches on the dataset you attach. Pick a dataset that already has data uploaded and columns (or files) mapped. Type-specific behavior is in the pages linked above.