Install a cluster

Enterprise only

On-prem compute is available on organizations with an Enterprise plan. Without it, the On-prem training cluster menu item is hidden. See Plan permissions.

Connect your hardware to Arena after a Manager has enabled on-prem and defined at least one resource class (unless you use the CLI path that creates the class for you).

Prerequisites

  • Enterprise organization

  • Arena CLI installed (pip install agilerl) and arena login or ARENA_API_KEY if you use the CLI (Authentication)

  • Worker sizing in the class matches the machines you will train on

Install with Docker Swarm or Helm on Kubernetes — one path per cluster, not both:

Path

You need

Docker Swarm

SSH access to the manager and worker hosts from the machine running install

Helm

kubectl pointed at your target cluster

Shared storage (NFS)

A Ray cluster with more than one worker often needs a shared filesystem — usually NFS or an equivalent your platform team already runs — mounted at the same path on the head node and every worker. Training jobs may use it for dataset caches, checkpoints, or other files that must be visible across nodes.

Arena’s generated bundles connect WireGuard and Ray head/workers. They do not run an NFS server for you. Plan shared storage alongside SSH or kubectl access.

Install type

Typical approach

Docker Swarm

Generated arena-stack.yaml mounts /opt/data on ray-head and ray-worker (local volume per node by default). For multi-worker pools, run NFS (or use an existing export), mount it on the manager and GPU workers at the same host path, then replace those mounts with a bind to that path before deploy (see the README tab in the Swarm bundle).

Helm

/opt/data is always mounted on ray-head and ray-worker pods (emptyDir when nfs.enabled: false). For shared storage, set nfs.enabled: true in chart/values.yaml and either nfs.server + nfs.path for a direct NFS mount, or nfs.existingClaim for a PVC your platform team created. Default nfs.mountPath is /opt/data (same as cloud Arena Ray clusters). See the README tab in the Helm bundle for examples.

Single-node smoke tests may run without shared storage; production multi-GPU pools should assume you need it unless your team has confirmed otherwise for your experiment types.

Step 1: Enable on-prem and add a class

UI path (most teams):

  1. Profile menu → On-prem training cluster.

  2. Enable on-prem.

  3. Add resource class — set Name, Number of nodes, and Compute resource (per worker node) fields.

CLI path: with the Arena CLI installed, arena on-prem install can enable the provider and create the class named in the command when they do not exist yet. You still need an Enterprise org and a logged-in CLI session.

For a second hardware pool (different GPU type or site), add another class with a distinct Name, size CPUs / GPUs / Memory for those workers, and run install again on the matching hosts.

Step 2: Choose Docker Swarm or Helm

Pick one path per cluster — Swarm or Kubernetes (Helm), not both. On the training cluster page, expand Config on the class row and use the Docker Swarm / Helm toggle. The install command and downloaded bundle match the selected type.

Type

Where install runs

Host access

Docker Swarm

SSH to manager and workers

CLI installs Docker on fresh nodes when needed

Helm

Your laptop against kubectl

No SSH; uses the current kube context

Step 3: Run install

Download setup (.tar)

  1. In Config, choose Swarm or Helm, then Download setup (.tar).

  2. Copy the archive to a bastion, extract it, cd arena-train.

  3. Run ./setup.sh or follow the README tab in the same panel.

Step 4: Verify

  1. On On-prem training cluster, the class Setup bundle column should show Current.

  2. Create a test experiment, open Resources, and confirm your on-prem class appears (sorted above cloud classes, zero credits).

  3. Train and confirm the run leaves Pending within the usual window (see troubleshooting).

Teardown

Remove the stack on your side before deleting the class in Arena:

arena on-prem teardown MY-CLASS --setup-type dockerSwarm --manager MANAGER_HOST
arena on-prem teardown MY-CLASS --setup-type helm

Then Delete the class on the training cluster page if you no longer need it.