Troubleshooting

Pick the section that matches what you see in the app. Each block lists likely causes and links to the guide that walks through the fix. Status labels, plan limits, and shared terms are on the reference pages if you need a definition.

More documentation

Reference

Use when

Glossary

Terms (org, project, training configuration, resource class)

Experiment statuses

Status values and transitions

Algorithms

Algorithm names and project-type fit

Plan permissions

Seats, storage, deployments, who can change billing

Finding your way

Sidebar and profile menu screens

Getting started: overview, quickstart, Classic vs Advanced, credits and plans.

Account and billing: account, profile and CLI keys, account deletion, organizations, billing, usage and statements.

Projects and experiments: projects, experiments, wizard, resources, environment step, agent step, training step, HPO step, train / halt / resume, logs and metrics.

Environments and data: environments, custom environments, validation and profiling, datasets (reasoning, preference, SFT, tabular, non-tabular).

Training and results: training, classic RL algorithms, LLM algorithms, supervised, training settings, results, checkpoints, pipelines.

Agents, compute, CLI: agents, create deployment, invoke, inference contract, on-prem, resource classes, training cluster, CLI, CLI auth, CLI commands.


Training credits

Training jobs debit credits while they run. The rate depends on resource class and runtime; each debit shows on Usage (profile menu) for Managers.

Train disabled or job fails right after submit. Check the profile dropdown or Usage for remaining balance. At zero, scheduling may fail or stop until a Manager tops up. Free plans start with a fixed pool (see credits and plans); paid plans refresh monthly. Managers buy packs on Billing and members in the profile menu.

You expected more credits. Top-ups and plan grants appear as Top Up on the daily ledger. The progress bar on Usage reflects plan allowance, not one-off purchases mixed in.

Email about low balance. Turn Low credit alerts on or off under profile and CLI keys. Alerts fire from account rules; they do not block training by themselves.

Member cannot see usage. Only Managers open Usage and billing. Members still run experiments if the org has credits; see plan permissions.


Validation failures

Custom gym environments

Validation runs when you commit a version. A failed commit sets status Failed; fix the entrypoint or code and commit again. The side panel checklist shows which checks failed; warnings alone do not block a green pass if there are no errors.

After you change code or dependencies, artifacts may go stale. Re-run validation before you trust charts or launch training. Launch experiment from the versions table needs status Validated, CPU and RAM per env from profiling, and profiling must not still be in its active window (about fifteen minutes while status is Profiling). Details: validation and profiling.

Experiment wizard

On Draft experiments, Next and Save stay blocked until the current step passes validation: resource chosen, environment set, algorithm form valid, advanced dataset rules (for example reasoning reward validated), and similar. The toast names the blocker. See wizard overview.

Train on Summary only appears when the experiment is runnable (resource validated and configuration complete). Custom gym experiments without a usable implementation can keep Train off even when the rest of the wizard looks done.

Datasets

Upload and format issues are type-specific. Start from datasets and the page for your format (tabular, preference, SFT, reasoning, non-tabular).


Stop training

In the UI the action is Stop training. It is available while status is Running, Stopping, or Pending.

After you stop, status becomes Stopping, then Stopped (or Failed depending on the run). Draft, Succeeded, and other terminal states do not offer stop. Full flow: train, halt, and resume. CLI users: CLI commands.


Deployment health

Agent rows use Undeployed, Pending, Deployed, and Failed. Experiment run statuses are separate; see experiment statuses.

Stuck on Pending after Connect. The platform copies the checkpoint and starts inference. Wait until status becomes Deployed or Failed. Large checkpoints take longer. Check deployment limits on Usage and plan permissions.

Was Deployed, now Pending again. Health checks can mark an endpoint unreachable and move status back to Pending while retrying, or to Failed if recovery fails. Disconnect, wait, and Connect again if Failed persists.

Failed. Read any error text the UI shows, fix the source checkpoint or config, and redeploy. Invoke and playground errors (server errors or empty completion) often mean the endpoint is not ready: invoke, inference contract.

Create flow: create deployment. Agents table: agents.


Experiment stuck in Pending

Right after Train. Scheduling success sets status to Pending. The train toast says work usually starts within about ten minutes when compute is ready. That window is normal.

Still Pending after ~10 minutes. Open logs once a run exists; metrics charts for scores stay empty until Running or a terminal status.

What you can do

  1. Confirm the project still has credits and Train was available before you submitted (resources).

  2. Open View logs on the row if enabled (not available for Draft).

  3. Use Stop training while status is Pending if you want to cancel the wait.

  4. On-prem clusters — see the table below.

Symptom

Check

No on-prem classes on Resources

Provider Enabled? Class Enabled? Org on Enterprise? Advanced Training needs a class with at least one GPU per worker

Pending more than ~10 minutes on on-prem

Workers running? Setup bundle Current? Re-run install after Update recommended

Update recommended on class row

Re-download .tar or run arena on-prem install again (On-prem CLI)

Resume greyed out. Resume is hidden for Draft, Running, and Pending. Wait for a terminal status or stop the run first.

Status diagram: experiment statuses. Train flow: train, halt, and resume.


Other common blockers

Symptom

Likely cause

Read

No metrics on Results while Pending

Metrics load only for running or finished statuses

logs and metrics

Wrong algorithm choices

Classic vs advanced project

classic vs advanced, algorithms

Cannot invite members

Free plan single seat

credits and plans

CLI 401

Missing or rotated API key

CLI authentication

No on-prem classes on Resources

Provider or class disabled; wrong plan; GPU required for Advanced Training

resource classes

On-prem Update recommended

Platform training image changed

install a cluster

If something here does not match what you see in the app, note the org, experiment id, and status label and contact AgileRL support with that context.