Troubleshooting¶
Pick the section that matches what you see in the app. Each block lists likely causes and links to the guide that walks through the fix. Status labels, plan limits, and shared terms are on the reference pages if you need a definition.
More documentation¶
Reference |
Use when |
|---|---|
Terms (org, project, training configuration, resource class) |
|
Status values and transitions |
|
Algorithm names and project-type fit |
|
Seats, storage, deployments, who can change billing |
|
Sidebar and profile menu screens |
Getting started: overview, quickstart, Classic vs Advanced, credits and plans.
Account and billing: account, profile and CLI keys, account deletion, organizations, billing, usage and statements.
Projects and experiments: projects, experiments, wizard, resources, environment step, agent step, training step, HPO step, train / halt / resume, logs and metrics.
Environments and data: environments, custom environments, validation and profiling, datasets (reasoning, preference, SFT, tabular, non-tabular).
Training and results: training, classic RL algorithms, LLM algorithms, supervised, training settings, results, checkpoints, pipelines.
Agents, compute, CLI: agents, create deployment, invoke, inference contract, on-prem, resource classes, training cluster, CLI, CLI auth, CLI commands.
Training credits¶
Training jobs debit credits while they run. The rate depends on resource class and runtime; each debit shows on Usage (profile menu) for Managers.
Train disabled or job fails right after submit. Check the profile dropdown or Usage for remaining balance. At zero, scheduling may fail or stop until a Manager tops up. Free plans start with a fixed pool (see credits and plans); paid plans refresh monthly. Managers buy packs on Billing and members in the profile menu.
You expected more credits. Top-ups and plan grants appear as Top Up on the daily ledger. The progress bar on Usage reflects plan allowance, not one-off purchases mixed in.
Email about low balance. Turn Low credit alerts on or off under profile and CLI keys. Alerts fire from account rules; they do not block training by themselves.
Member cannot see usage. Only Managers open Usage and billing. Members still run experiments if the org has credits; see plan permissions.
Validation failures¶
Custom gym environments¶
Validation runs when you commit a version. A failed commit sets status Failed; fix the entrypoint or code and commit again. The side panel checklist shows which checks failed; warnings alone do not block a green pass if there are no errors.
After you change code or dependencies, artifacts may go stale. Re-run validation before you trust charts or launch training. Launch experiment from the versions table needs status Validated, CPU and RAM per env from profiling, and profiling must not still be in its active window (about fifteen minutes while status is Profiling). Details: validation and profiling.
Experiment wizard¶
On Draft experiments, Next and Save stay blocked until the current step passes validation: resource chosen, environment set, algorithm form valid, advanced dataset rules (for example reasoning reward validated), and similar. The toast names the blocker. See wizard overview.
Train on Summary only appears when the experiment is runnable (resource validated and configuration complete). Custom gym experiments without a usable implementation can keep Train off even when the rest of the wizard looks done.
Datasets¶
Upload and format issues are type-specific. Start from datasets and the page for your format (tabular, preference, SFT, reasoning, non-tabular).
Stop training¶
In the UI the action is Stop training. It is available while status is Running, Stopping, or Pending.
After you stop, status becomes Stopping, then Stopped (or Failed depending on the run). Draft, Succeeded, and other terminal states do not offer stop. Full flow: train, halt, and resume. CLI users: CLI commands.
Deployment health¶
Agent rows use Undeployed, Pending, Deployed, and Failed. Experiment run statuses are separate; see experiment statuses.
Stuck on Pending after Connect. The platform copies the checkpoint and starts inference. Wait until status becomes Deployed or Failed. Large checkpoints take longer. Check deployment limits on Usage and plan permissions.
Was Deployed, now Pending again. Health checks can mark an endpoint unreachable and move status back to Pending while retrying, or to Failed if recovery fails. Disconnect, wait, and Connect again if Failed persists.
Failed. Read any error text the UI shows, fix the source checkpoint or config, and redeploy. Invoke and playground errors (server errors or empty completion) often mean the endpoint is not ready: invoke, inference contract.
Create flow: create deployment. Agents table: agents.
Experiment stuck in Pending¶
Right after Train. Scheduling success sets status to Pending. The train toast says work usually starts within about ten minutes when compute is ready. That window is normal.
Still Pending after ~10 minutes. Open logs once a run exists; metrics charts for scores stay empty until Running or a terminal status.
What you can do
Confirm the project still has credits and Train was available before you submitted (resources).
Open View logs on the row if enabled (not available for Draft).
Use Stop training while status is Pending if you want to cancel the wait.
On-prem clusters — see the table below.
Symptom |
Check |
|---|---|
No on-prem classes on Resources |
Provider Enabled? Class Enabled? Org on Enterprise? Advanced Training needs a class with at least one GPU per worker |
Pending more than ~10 minutes on on-prem |
Workers running? Setup bundle Current? Re-run install after Update recommended |
Update recommended on class row |
Re-download |
Resume greyed out. Resume is hidden for Draft, Running, and Pending. Wait for a terminal status or stop the run first.
Status diagram: experiment statuses. Train flow: train, halt, and resume.
Other common blockers¶
Symptom |
Likely cause |
Read |
|---|---|---|
No metrics on Results while Pending |
Metrics load only for running or finished statuses |
|
Wrong algorithm choices |
Classic vs advanced project |
|
Cannot invite members |
Free plan single seat |
|
CLI 401 |
Missing or rotated API key |
|
No on-prem classes on Resources |
Provider or class disabled; wrong plan; GPU required for Advanced Training |
|
On-prem Update recommended |
Platform training image changed |
If something here does not match what you see in the app, note the org, experiment id, and status label and contact AgileRL support with that context.