Troubleshooting¶

Pick the section that matches what you see in the app. Each block lists likely causes and links to the guide that walks through the fix. Status labels, plan limits, and shared terms are on the reference pages if you need a definition.

Reference	Use when
Glossary	Terms (org, project, training configuration, resource class)
Experiment statuses	Status values and transitions
Algorithms	Algorithm names and project-type fit
Plan permissions	Seats, storage, deployments, who can change billing
Finding your way	Sidebar and profile menu screens

Training credits¶

Training jobs debit credits while they run. The rate depends on resource class and runtime; each debit shows on Usage (profile menu) for Managers.

Train disabled or job fails right after submit. Check the profile dropdown or Usage for remaining balance. At zero, scheduling may fail or stop until a Manager tops up. Free plans start with a fixed pool (see credits and plans); paid plans refresh monthly. Managers buy packs on Billing and members in the profile menu.

You expected more credits. Top-ups and plan grants appear as Top Up on the daily ledger. The progress bar on Usage reflects plan allowance, not one-off purchases mixed in.

Email about low balance. Turn Low credit alerts on or off under profile and CLI keys. Alerts fire from account rules; they do not block training by themselves.

Member cannot see usage. Only Managers open Usage and billing. Members still run experiments if the org has credits; see plan permissions.

Validation failures¶

Custom gym environments¶

Validation runs when you commit a version. A failed commit sets status Failed; fix the entrypoint or code and commit again. The side panel checklist shows which checks failed; warnings alone do not block a green pass if there are no errors.

After you change code or dependencies, artifacts may go stale. Re-run validation before you trust charts or launch training. Launch experiment from the versions table needs status Validated, CPU and RAM per env from profiling, and profiling must not still be in its active window (about fifteen minutes while status is Profiling). Details: validation and profiling.

Experiment wizard¶

On Draft experiments, Next and Save stay blocked until the current step passes validation: resource chosen, environment set, algorithm form valid, advanced dataset rules (for example reasoning reward validated), and similar. The toast names the blocker. See wizard overview.

Train on Summary only appears when the experiment is runnable (resource validated and configuration complete). Custom gym experiments without a usable implementation can keep Train off even when the rest of the wizard looks done.

Datasets¶

Upload and format issues are type-specific. Start from datasets and the page for your format (tabular, preference, SFT, reasoning, non-tabular).

Stop training¶

In the UI the action is Stop training. It is available while status is Running, Stopping, or Pending.

After you stop, status becomes Stopping, then Stopped (or Failed depending on the run). Draft, Succeeded, and other terminal states do not offer stop. Full flow: train, halt, and resume. CLI users: CLI commands.

Deployment health¶

Agent rows use Undeployed, Pending, Deployed, and Failed. Experiment run statuses are separate; see experiment statuses.

Stuck on Pending after Connect. The platform copies the checkpoint and starts inference. Wait until status becomes Deployed or Failed. Large checkpoints take longer. Check deployment limits on Usage and plan permissions.

Was Deployed, now Pending again. Health checks can mark an endpoint unreachable and move status back to Pending while retrying, or to Failed if recovery fails. Disconnect, wait, and Connect again if Failed persists.

Failed. Read any error text the UI shows, fix the source checkpoint or config, and redeploy. Invoke and playground errors (server errors or empty completion) often mean the endpoint is not ready: invoke, inference contract.

Create flow: create deployment. Agents table: agents.

Experiment stuck in Pending¶

Right after Train. Scheduling success sets status to Pending. The train toast says work usually starts within about ten minutes when compute is ready. That window is normal.

Still Pending after ~10 minutes. Open logs once a run exists; metrics charts for scores stay empty until Running or a terminal status.

What you can do

Confirm the project still has credits and Train was available before you submitted (resources).
Open View logs on the row if enabled (not available for Draft).
Use Stop training while status is Pending if you want to cancel the wait.
On-prem clusters — see the table below.

Symptom	Check
No on-prem classes on Resources	Provider Enabled? Class Enabled? Org on Enterprise? Advanced Training needs a class with at least one GPU per worker
Pending more than ~10 minutes on on-prem	Workers running? Setup bundle Current? Re-run install after Update recommended
Update recommended on class row	Re-download `.tar` or run `arena on-prem install` again (On-prem CLI)

Resume greyed out. Resume is hidden for Draft, Running, and Pending. Wait for a terminal status or stop the run first.

Status diagram: experiment statuses. Train flow: train, halt, and resume.

Other common blockers¶

Symptom	Likely cause	Read
No metrics on Results while Pending	Metrics load only for running or finished statuses	logs and metrics
Wrong algorithm choices	Classic vs advanced project	classic vs advanced, algorithms
Cannot invite members	Free plan single seat	credits and plans
CLI 401	Missing or rotated API key	CLI authentication
No on-prem classes on Resources	Provider or class disabled; wrong plan; GPU required for Advanced Training	resource classes
On-prem Update recommended	Platform training image changed	install a cluster

If something here does not match what you see in the app, note the org, experiment id, and status label and contact AgileRL support with that context.