LLM algorithms¶

Advanced Training experiments that use a reasoning, preference, or LLM simulation environment show the algorithms on this page. The Agent step filters the list from your dataset or environment choice.

Reasoning datasets (GRPO family)¶

Typical choices when a Reasoning dataset is attached:

Name	Role in Arena
GRPO	Group-based training without a separate value network
GSPO	Variant aimed at steadier group updates
CISPO	More conservative policy updates within the same family

Preference data¶

Name	Role
DPO	Direct preference optimization from chosen vs rejected pairs

SFT can also appear when the flow is wired for preference-style data; for a dedicated SFT dataset, use Supervised training.

LLM simulation environments¶

When Environment uses a language-based gym simulation instead of a static dataset export, Agent emphasizes:

Name	Role
TurnPPO	PPO with turn-level actions
TokenPPO	PPO with per-token actions
TurnREINFORCE	REINFORCE at turn granularity
TokenREINFORCE	REINFORCE at token granularity

GRPO, GSPO, and CISPO can still appear on simulation paths.

Network and training forms¶

LLM runs pick a pretrained model on Agent, then algorithm-specific fields and a Training subsection that changes with the algorithm (standard LLM training vs simulation-specific fields). Switching algorithms reloads defaults for that pair of algorithm and environment.

LLM algorithms¶

Reasoning datasets (GRPO family)¶

Preference data¶

LLM simulation environments¶

Network and training forms¶

Related¶