Algorithms reference¶
The experiment wizard and Agents page only show algorithms that fit your project. Classic RL projects get gym-style trainers; advanced projects also depend on dataset type. The tables below match those pickers, not every algorithm paper ever published.
Classic RL (gym environments)¶
Single-agent¶
Name |
Notes |
|---|---|
DQN |
Value-based, replay buffer |
Rainbow DQN |
DQN extensions |
PPO |
On-policy policy gradient |
Recurrent PPO |
Recurrent policy |
DDPG |
Continuous control actor-critic |
TD3 |
Twin critics |
Multi-agent¶
Name |
Notes |
|---|---|
MADDPG |
Multi-agent DDPG |
MATD3 |
Multi-agent TD3 |
IPPO |
Independent PPO per agent |
Deploy these from the Classic RL Agents tab. Manual HTTP snippets use get_action. See Inference contract.
LLM and reasoning (dataset / advanced)¶
Name |
Typical dataset / environment |
|---|---|
GRPO |
Reasoning |
GSPO |
Reasoning |
CISPO |
Reasoning |
DPO |
Preference (with SFT as an alternative on the same dataset type) |
SFT |
SFT, or preference datasets when offered alongside DPO |
TurnPPO |
Language gym (simulation) |
TokenPPO |
Language gym |
TurnREINFORCE |
Language gym |
TokenREINFORCE |
Language gym |
GSPO and CISPO share the same training form family as GRPO. TokenPPO and TokenREINFORCE pair with TurnPPO and TurnREINFORCE run configurations.
Deploy from Advanced Training. Snippets and the chat playground use generate.
Supervised and LatentPPO¶
Name |
Use |
|---|---|
Supervised |
Tabular / non-tabular supervised tasks |
LatentPPO |
Latent module between pretrained blocks; pipeline decoder chaining |
SFT |
Supervised fine-tuning on prompt–target pairs (listed here for training context; deploys as LLM with generate) |
Deploy Supervised and LatentPPO from Advanced Training. Connect enables live inference; Manual HTTP snippets use predict for both. SFT trains like other LLM algorithms and deploys with generate. See Supervised training.