Agent step¶

Step three is Agent. Pick the learning algorithm and configure the network or language model.

Classic flow¶

Algorithm is required. Open the catalog and choose a card (names such as PPO, TD3, DQN come from the catalog).

Below that, Neural network holds layer and activation settings. Toggles include Use MLP encoder for vector observations, SimBa architecture, and per-observation Image / Vector when the form needs them.

SimBa architecture (when shown) switches to a simplicity-biased MLP design intended to reduce overfitting on vector observations. Hover the info icon next to the label for the full description.

Arena picks MLP or CNN defaults from observation type (vector vs image). You can add or remove layers in the form. Info icons on fields explain individual hyperparameters.

Algorithm families	Typical action spaces
DQN, Rainbow DQN, PPO, Recurrent PPO (discrete actions)	Discrete actions
DDPG, TD3, PPO, Recurrent PPO (continuous actions)	Continuous actions
MADDPG, MATD3, IPPO	Multi-agent envs

The catalog only offers algorithms compatible with the environment you chose on the previous step. Recurrent PPO appears only when Observation Type is Vector (not Image or Mixed).

Advanced Training flow¶

What you see depends on what you picked on Environment:

You selected	Agent UI
RL Environment or reasoning / preference / SFT dataset	Language model, Algorithm parameters, Target modules, model picker (Select a model…)
Tabular or non-tabular dataset	Classic-style Algorithm and Neural network, plus supervised fields
Custom algorithm	Custom Algorithm {name} placeholder

LLM path highlights:

Target modules: info text Select which modules should receive LoRA adapters.
Please select at least one target module before moving to {step} if none are chosen.
GPU memory error: Model exceeds available GPU memory. Please reduce number of selected LoRA modules, LoRA rank, or increase resources. Fix on Resources or shrink the model here.

LatentPPO (non-tabular object detection only, after a Supervised saved model exists) adds Training mode (Joint / Frozen Decoder), Trainable latent adapter, and Saved decoder checkpoint with Select saved model or No saved models available. Tabular Supervised uses vector or encoder network fields without those LatentPPO-only controls.

Saving as you leave¶

On Next, the wizard saves draft changes for this step together with algorithm and environment choices.

Next step¶

Continue to Training for timesteps, batching, and LLM training fields where applicable. See Training step.

Algorithm reference: Algorithms.