Agent step

Step three is Agent. Pick the learning algorithm and configure the network or language model.

Classic flow

Algorithm is required. Open the catalog and choose a card (names such as PPO, TD3, DQN come from the catalog).

Below that, Neural network holds layer and activation settings. Toggles include Use MLP encoder for vector observations, SimBa architecture, and per-observation Image / Vector when the form needs them.

SimBa architecture (when shown) switches to a simplicity-biased MLP design intended to reduce overfitting on vector observations. Hover the info icon next to the label for the full description.

Arena picks MLP or CNN defaults from observation type (vector vs image). You can add or remove layers in the form. Info icons on fields explain individual hyperparameters.

Algorithm families

Typical action spaces

DQN, Rainbow DQN, PPO, Recurrent PPO (discrete actions)

Discrete actions

DDPG, TD3, PPO, Recurrent PPO (continuous actions)

Continuous actions

MADDPG, MATD3, IPPO

Multi-agent envs

The catalog only offers algorithms compatible with the environment you chose on the previous step. Recurrent PPO appears only when Observation Type is Vector (not Image or Mixed).

Advanced Training flow

What you see depends on what you picked on Environment:

You selected

Agent UI

RL Environment or reasoning / preference / SFT dataset

Language model, Algorithm parameters, Target modules, model picker (Select a model…)

Tabular or non-tabular dataset

Classic-style Algorithm and Neural network, plus supervised fields

Custom algorithm

Custom Algorithm {name} placeholder

LLM path highlights:

  • Target modules: info text Select which modules should receive LoRA adapters.

  • Please select at least one target module before moving to {step} if none are chosen.

  • GPU memory error: Model exceeds available GPU memory. Please reduce number of selected LoRA modules, LoRA rank, or increase resources. Fix on Resources or shrink the model here.

LatentPPO (non-tabular object detection only, after a Supervised saved model exists) adds Training mode (Joint / Frozen Decoder), Trainable latent adapter, and Saved decoder checkpoint with Select saved model or No saved models available. Tabular Supervised uses vector or encoder network fields without those LatentPPO-only controls.

Saving as you leave

On Next, the wizard saves draft changes for this step together with algorithm and environment choices.

Next step

Continue to Training for timesteps, batching, and LLM training fields where applicable. See Training step.

Algorithm reference: Algorithms.