Algorithms reference

The experiment wizard and Agents page only show algorithms that fit your project. Classic RL projects get gym-style trainers; advanced projects also depend on dataset type. The tables below match those pickers, not every algorithm paper ever published.

Classic RL (gym environments)

Single-agent

Name

Notes

DQN

Value-based, replay buffer

Rainbow DQN

DQN extensions

PPO

On-policy policy gradient

Recurrent PPO

Recurrent policy

DDPG

Continuous control actor-critic

TD3

Twin critics

Multi-agent

Name

Notes

MADDPG

Multi-agent DDPG

MATD3

Multi-agent TD3

IPPO

Independent PPO per agent

Deploy these from the Classic RL Agents tab. Manual HTTP snippets use get_action. See Inference contract.

LLM and reasoning (dataset / advanced)

Name

Typical dataset / environment

GRPO

Reasoning

GSPO

Reasoning

CISPO

Reasoning

DPO

Preference (with SFT as an alternative on the same dataset type)

SFT

SFT, or preference datasets when offered alongside DPO

TurnPPO

Language gym (simulation)

TokenPPO

Language gym

TurnREINFORCE

Language gym

TokenREINFORCE

Language gym

GSPO and CISPO share the same training form family as GRPO. TokenPPO and TokenREINFORCE pair with TurnPPO and TurnREINFORCE run configurations.

Deploy from Advanced Training. Snippets and the chat playground use generate.

Supervised and LatentPPO

Name

Use

Supervised

Tabular / non-tabular supervised tasks

LatentPPO

Latent module between pretrained blocks; pipeline decoder chaining

SFT

Supervised fine-tuning on prompt–target pairs (listed here for training context; deploys as LLM with generate)

Deploy Supervised and LatentPPO from Advanced Training. Connect enables live inference; Manual HTTP snippets use predict for both. SFT trains like other LLM algorithms and deploys with generate. See Supervised training.