Algorithms reference¶

The experiment wizard and Agents page only show algorithms that fit your project. Classic RL projects get gym-style trainers; advanced projects also depend on dataset type. The tables below match those pickers, not every algorithm paper ever published.

Classic RL (gym environments)¶

Single-agent¶

Name	Notes
DQN	Value-based, replay buffer
Rainbow DQN	DQN extensions
PPO	On-policy policy gradient
Recurrent PPO	Recurrent policy
DDPG	Continuous control actor-critic
TD3	Twin critics

Multi-agent¶

Name	Notes
MADDPG	Multi-agent DDPG
MATD3	Multi-agent TD3
IPPO	Independent PPO per agent

Deploy these from the Classic RL Agents tab. Manual HTTP snippets use get_action. See Inference contract.

LLM and reasoning (dataset / advanced)¶

Name	Typical dataset / environment
GRPO	Reasoning
GSPO	Reasoning
CISPO	Reasoning
DPO	Preference (with SFT as an alternative on the same dataset type)
SFT	SFT, or preference datasets when offered alongside DPO
TurnPPO	Language gym (simulation)
TokenPPO	Language gym
TurnREINFORCE	Language gym
TokenREINFORCE	Language gym

GSPO and CISPO share the same training form family as GRPO. TokenPPO and TokenREINFORCE pair with TurnPPO and TurnREINFORCE run configurations.

Deploy from Advanced Training. Snippets and the chat playground use generate.

Supervised and LatentPPO¶

Name	Use
Supervised	Tabular / non-tabular supervised tasks
LatentPPO	Latent module between pretrained blocks; pipeline decoder chaining
SFT	Supervised fine-tuning on prompt–target pairs (listed here for training context; deploys as LLM with generate)

Deploy Supervised and LatentPPO from Advanced Training. Connect enables live inference; Manual HTTP snippets use predict for both. SFT trains like other LLM algorithms and deploys with generate. See Supervised training.