Classic RL algorithms¶
Classic RL projects train against a gym environment from the Environment step: Arena built-ins or your uploaded custom environment. The Agent step lists the algorithms below; Training exposes network, replay buffer (when the algorithm uses one), and loop hyperparameters.
Single-agent¶
Use these when the environment has one decision-maker. The subsections group them by how they learn, not by agent count.
Value-based (off-policy)¶
Name |
Notes |
|---|---|
DQN |
Replay buffer and target network; epsilon decay appears in training forms |
Rainbow DQN |
DQN extensions such as prioritized replay and distributional options in the algorithm section |
Policy gradient and actor-critic¶
Name |
Notes |
|---|---|
PPO |
Common default for many discrete and continuous gym tasks |
Recurrent PPO |
PPO with a recurrent policy when observations need memory |
DDPG |
Continuous control with actor and critic |
TD3 |
Twin critics, delayed policy updates, target smoothing |
Multi-agent¶
Use these when the environment exposes multiple agents.
Name |
Notes |
|---|---|
MADDPG |
Multi-agent DDPG |
MATD3 |
Multi-agent TD3 |
IPPO |
Independent PPO per agent |
Choosing one¶
Among single-agent trainers, off-policy methods (DQN, Rainbow, DDPG, TD3) sample a replay buffer and often need fewer fresh environment steps per update. PPO and Recurrent PPO collect new rollouts each cycle and are usually easier to tune on standard benchmarks. Multi-agent algorithms assume your environment reports the right agent count and observation spaces.
When you change the algorithm in the wizard, Arena refreshes defaults on Agent and Training to match.