Classic RL algorithms¶

Classic RL projects train against a gym environment from the Environment step: Arena built-ins or your uploaded custom environment. The Agent step lists the algorithms below; Training exposes network, replay buffer (when the algorithm uses one), and loop hyperparameters.

Single-agent¶

Use these when the environment has one decision-maker. The subsections group them by how they learn, not by agent count.

Value-based (off-policy)¶

Name	Notes
DQN	Replay buffer and target network; epsilon decay appears in training forms
Rainbow DQN	DQN extensions such as prioritized replay and distributional options in the algorithm section

Policy gradient and actor-critic¶

Name	Notes
PPO	Common default for many discrete and continuous gym tasks
Recurrent PPO	PPO with a recurrent policy when observations need memory
DDPG	Continuous control with actor and critic
TD3	Twin critics, delayed policy updates, target smoothing

Multi-agent¶

Use these when the environment exposes multiple agents.

Name	Notes
MADDPG	Multi-agent DDPG
MATD3	Multi-agent TD3
IPPO	Independent PPO per agent

Choosing one¶

Among single-agent trainers, off-policy methods (DQN, Rainbow, DDPG, TD3) sample a replay buffer and often need fewer fresh environment steps per update. PPO and Recurrent PPO collect new rollouts each cycle and are usually easier to tune on standard benchmarks. Multi-agent algorithms assume your environment reports the right agent count and observation spaces.

When you change the algorithm in the wizard, Arena refreshes defaults on Agent and Training to match.