Environment step¶
Step two is Environment. Define what the agent learns against: a gym in classic projects, or a dataset or RL Environment row in Advanced Training projects.
Classic projects¶
You pick from a searchable grid. Columns include Environment, Version, Category, Observation Type, Action Type, and Algorithms (a subset of trainers Arena lists for that env). Custom environments show Select version to open the Select version dialog.
After you confirm:
Selected environment: with the name and version shown.
Change environment to go back to the grid.
Custom environments show validation and profiling panels when data exists:
Rendered environment (or a message to re-validate if nothing is shown).
Resource Usage with CPU and RAM per worker when profiling finished.
Random episode score distribution for rollout returns.
Stale banners tell you to re-validate when the code changed since the charts were generated.
Next without a selection shows: Please select an environment before moving to the next step. (Wording varies slightly between classic and advanced flows.)
Advanced Training projects¶
There is no separate Gym / Dataset toggle. You choose a row from one grid:
Dataset rows: types such as reasoning, preference, SFT, tabular, or non-tabular (labels appear in sentence case in the Type column).
RL Environment rows: simulation-style gyms; use Select version like custom classic envs.
After a dataset is selected you see Selected dataset: and Change dataset.
Reasoning datasets¶
A two-step sub-flow appears: Required steps: complete both steps before proceeding
Prompts: User prompt, Python f-string, Rendered string, optional Advanced prompting settings, and a sample table with Question and Answer.
Reward Function: tabs Editor and Validation Results; reward.py in the editor; Validate to run checks. You need Calculated reward and passing validation before Next on a draft.
Other dataset types¶
Preference, SFT, tabular, and non-tabular flows use dataset-specific forms on this step. Tabular and non-tabular types require an Enterprise plan.
View-only (non-draft)¶
Revisiting Environment on a run that already started skips reward validation. You can read prompts and reward settings without re-running Validate.
Next step¶
Continue to Agent for algorithm and model. See Agent step.
Dataset background: Datasets.