Classic RL algorithms

Classic RL projects train against a gym environment from the Environment step: Arena built-ins or your uploaded custom environment. The Agent step lists the algorithms below; Training exposes network, replay buffer (when the algorithm uses one), and loop hyperparameters.

Single-agent

Use these when the environment has one decision-maker. The subsections group them by how they learn, not by agent count.

Value-based (off-policy)

Name

Notes

DQN

Replay buffer and target network; epsilon decay appears in training forms

Rainbow DQN

DQN extensions such as prioritized replay and distributional options in the algorithm section

Policy gradient and actor-critic

Name

Notes

PPO

Common default for many discrete and continuous gym tasks

Recurrent PPO

PPO with a recurrent policy when observations need memory

DDPG

Continuous control with actor and critic

TD3

Twin critics, delayed policy updates, target smoothing

Multi-agent

Use these when the environment exposes multiple agents.

Name

Notes

MADDPG

Multi-agent DDPG

MATD3

Multi-agent TD3

IPPO

Independent PPO per agent

Choosing one

Among single-agent trainers, off-policy methods (DQN, Rainbow, DDPG, TD3) sample a replay buffer and often need fewer fresh environment steps per update. PPO and Recurrent PPO collect new rollouts each cycle and are usually easier to tune on standard benchmarks. Multi-agent algorithms assume your environment reports the right agent count and observation spaces.

When you change the algorithm in the wizard, Arena refreshes defaults on Agent and Training to match.