LLM algorithms¶
Advanced Training experiments that use a reasoning, preference, or LLM simulation environment show the algorithms on this page. The Agent step filters the list from your dataset or environment choice.
Reasoning datasets (GRPO family)¶
Typical choices when a Reasoning dataset is attached:
Name |
Role in Arena |
|---|---|
GRPO |
Group-based training without a separate value network |
GSPO |
Variant aimed at steadier group updates |
CISPO |
More conservative policy updates within the same family |
Preference data¶
Name |
Role |
|---|---|
DPO |
Direct preference optimization from chosen vs rejected pairs |
SFT can also appear when the flow is wired for preference-style data; for a dedicated SFT dataset, use Supervised training.
LLM simulation environments¶
When Environment uses a language-based gym simulation instead of a static dataset export, Agent emphasizes:
Name |
Role |
|---|---|
TurnPPO |
PPO with turn-level actions |
TokenPPO |
PPO with per-token actions |
TurnREINFORCE |
REINFORCE at turn granularity |
TokenREINFORCE |
REINFORCE at token granularity |
GRPO, GSPO, and CISPO can still appear on simulation paths.
Network and training forms¶
LLM runs pick a pretrained model on Agent, then algorithm-specific fields and a Training subsection that changes with the algorithm (standard LLM training vs simulation-specific fields). Switching algorithms reloads defaults for that pair of algorithm and environment.