LLM algorithms

Advanced Training experiments that use a reasoning, preference, or LLM simulation environment show the algorithms on this page. The Agent step filters the list from your dataset or environment choice.

Reasoning datasets (GRPO family)

Typical choices when a Reasoning dataset is attached:

Name

Role in Arena

GRPO

Group-based training without a separate value network

GSPO

Variant aimed at steadier group updates

CISPO

More conservative policy updates within the same family

Preference data

Name

Role

DPO

Direct preference optimization from chosen vs rejected pairs

SFT can also appear when the flow is wired for preference-style data; for a dedicated SFT dataset, use Supervised training.

LLM simulation environments

When Environment uses a language-based gym simulation instead of a static dataset export, Agent emphasizes:

Name

Role

TurnPPO

PPO with turn-level actions

TokenPPO

PPO with per-token actions

TurnREINFORCE

REINFORCE at turn granularity

TokenREINFORCE

REINFORCE at token granularity

GRPO, GSPO, and CISPO can still appear on simulation paths.

Network and training forms

LLM runs pick a pretrained model on Agent, then algorithm-specific fields and a Training subsection that changes with the algorithm (standard LLM training vs simulation-specific fields). Switching algorithms reloads defaults for that pair of algorithm and environment.