Install a cluster¶

Enterprise only

On-prem compute is available on organizations with an Enterprise plan. Without it, the On-prem training cluster menu item is hidden. See Plan permissions.

Connect your hardware to Arena after a Manager has enabled on-prem and defined at least one resource class (unless you use the CLI path that creates the class for you).

Prerequisites¶

Enterprise organization
Arena CLI installed (pip install "agilerl[arena]") and arena login or ARENA_API_KEY if you use the CLI (Authentication)
Worker sizing in the class matches the machines you will train on

Install with Docker Swarm or Helm on Kubernetes — one path per cluster, not both:

Path	You need
Docker Swarm	SSH access to the manager and worker hosts from the machine running install
Helm	`kubectl` pointed at your target cluster

Shared storage (NFS)¶

A Ray cluster with more than one worker often needs a shared filesystem — usually NFS or an equivalent your platform team already runs — mounted at the same path on the head node and every worker. Training jobs may use it for dataset caches, checkpoints, or other files that must be visible across nodes.

Arena’s generated bundles connect WireGuard and Ray head/workers. They do not run an NFS server for you. Plan shared storage alongside SSH or kubectl access.

Install type	Typical approach
Docker Swarm	Generated `arena-stack.yaml` mounts `/opt/data` on `ray-head` and `ray-worker` (local volume per node by default). For multi-worker pools, run NFS (or use an existing export), mount it on the manager and GPU workers at the same host path, then replace those mounts with a bind to that path before deploy (see the README tab in the Swarm bundle).
Helm	`/opt/data` is always mounted on ray-head and ray-worker pods (`emptyDir` when `nfs.enabled: false`). For shared storage, set `nfs.enabled: true` in `chart/values.yaml` and either `nfs.server` + `nfs.path` for a direct NFS mount, or `nfs.existingClaim` for a PVC your platform team created. Default `nfs.mountPath` is `/opt/data` (same as cloud Arena Ray clusters). See the README tab in the Helm bundle for examples.

Install type

Typical approach

Docker Swarm

Generated arena-stack.yaml mounts /opt/data on ray-head and ray-worker (local volume per node by default). For multi-worker pools, run NFS (or use an existing export), mount it on the manager and GPU workers at the same host path, then replace those mounts with a bind to that path before deploy (see the README tab in the Swarm bundle).

Helm

/opt/data is always mounted on ray-head and ray-worker pods (emptyDir when nfs.enabled: false). For shared storage, set nfs.enabled: true in chart/values.yaml and either nfs.server + nfs.path for a direct NFS mount, or nfs.existingClaim for a PVC your platform team created. Default nfs.mountPath is /opt/data (same as cloud Arena Ray clusters). See the README tab in the Helm bundle for examples.

Single-node smoke tests may run without shared storage; production multi-GPU pools should assume you need it unless your team has confirmed otherwise for your experiment types.

Step 1: Enable on-prem and add a class¶

UI path (most teams):

Profile menu → On-prem training cluster.
Enable on-prem.
Add resource class — set Name, Number of nodes, and Compute resource (per worker node) fields.

CLI path: with the Arena CLI installed, arena on-prem install can enable the provider and create the class named in the command when they do not exist yet. You still need an Enterprise org and a logged-in CLI session.

For a second hardware pool (different GPU type or site), add another class with a distinct Name, size CPUs / GPUs / Memory for those workers, and run install again on the matching hosts.

Step 2: Choose Docker Swarm or Helm¶

Pick one path per cluster — Swarm or Kubernetes (Helm), not both. On the training cluster page, expand Config on the class row and use the Docker Swarm / Helm toggle. The install command and downloaded bundle match the selected type.

Type	Where install runs	Host access
Docker Swarm	SSH to manager and workers	CLI installs Docker on fresh nodes when needed
Helm	Your laptop against `kubectl`	No SSH; uses the current kube context

Step 3: Run install¶

Arena CLI (recommended)¶

Install the CLI first (pip install "agilerl[arena]" — see Arena CLI). Copy the command from the Arena CLI (recommended) block at the top of the Config panel (it uses the class Name from that row), or run:

arena login
arena on-prem install MY-CLASS --setup-type dockerSwarm \
  --manager MANAGER_HOST --workers gpu1.example.com,gpu2.example.com

Helm on your cluster:

arena on-prem install MY-CLASS --setup-type helm

See On-prem CLI for teardown and flags.

When copying from Config, the class already exists — the command installs workers for that saved class. A standalone arena on-prem install NEW-NAME from your terminal can also create the provider and class first, then install.

Download setup (.tar)¶

In Config, choose Swarm or Helm, then Download setup (.tar).
Copy the archive to a bastion, extract it, cd arena-train.
Run ./setup.sh or follow the README tab in the same panel.

Step 4: Verify¶

On On-prem training cluster, the class Setup bundle column should show Current.
Create a test experiment, open Resources, and confirm your on-prem class appears (sorted above cloud classes, zero credits).
Train and confirm the run leaves Pending within the usual window (see troubleshooting).

Upgrade when Update recommended appears¶

Arena shipped a new training image. Re-run arena on-prem install for that class or download a fresh .tar and roll out on your cluster until Setup bundle returns to Current.

Teardown¶

Remove the stack on your side before deleting the class in Arena:

arena on-prem teardown MY-CLASS --setup-type dockerSwarm --manager MANAGER_HOST
arena on-prem teardown MY-CLASS --setup-type helm

Then Delete the class on the training cluster page if you no longer need it.

Install a cluster¶

Prerequisites¶

Shared storage (NFS)¶

Step 1: Enable on-prem and add a class¶

Step 2: Choose Docker Swarm or Helm¶

Step 3: Run install¶

Arena CLI (recommended)¶

Download setup (.tar)¶

Step 4: Verify¶

Upgrade when Update recommended appears¶

Teardown¶

Related¶