Training step

Step four is Training. Set how long training runs and core hyperparameters (batch size, discount factor, learning rates, and algorithm-specific options).

Classic projects

Accordion sections group fields:

  • Training

  • Environment

  • Replay Buffer (hidden for on-policy algorithms: PPO, IPPO, Recurrent PPO; MADDPG and MATD3 use a multi-agent replay buffer section instead)

  • Epsilon (DQN only)

Parameters follow the chosen algorithm. Common parameters include Batch size, Discount factor (Gamma), Learn step, Learning rate, Surrogate clipping coefficient, Entropy coefficient, Actor learning rate, Critic learning rate, and Double DQN.

Leaving the step runs validation on the form. Next is blocked while any field still shows an error.

Advanced Training projects

The heading is Training parameters. The same kinds of schema-driven fields appear, tuned for LLM or dataset paths you selected earlier.

Validation and drafts

On a Draft, unresolved form errors stop Next. Reasoning reward errors on this step are ignored if you already passed Validate on Environment.

Idle autosave and Next both persist draft changes. Training does not start until Train on Summary or Train in the experiments table.

Next step

Continue to HPO for search space and mutation. See HPO step.