SlurmRayClient is a drop-in replacement for RayClient that bootstraps a multi-node Ray cluster under SLURM. The head node starts ray start --head, writes the assigned GCS port to a shared file, and waits for workers to join. Worker nodes block on the port file and call ray start --block. Only the head returns from start(), so your pipeline code runs on the head while workers stay attached to the cluster.
When to Use It
Use SlurmRayClient when:
- You’re submitting Ray-backed pipelines through
sbatchand want to span more than one node. - You want a single-binary entry point: the same script works on a 1-node debug job and a 2-node production job without code changes.
For single-node SLURM jobs, the regular RayClient is sufficient.
Prerequisites
- SLURM cluster: a working SLURM cluster with at least one GPU node.
- Shared filesystem: a filesystem visible from every allocated node (NFS, Lustre, GPFS) — required for the Ray port broadcast file.
- Pyxis + enroot (recommended): for the container-based submit script. If unavailable, use the bare-metal
uvscript. - uv (alternative): for bare-metal execution. Install via astral.sh/uv.
- Source checkout: NeMo Curator source code on a shared filesystem path readable from every node.
Drop-in Replacement
Replace RayClient with SlurmRayClient. No other changes are required:
from nemo_curator.core.client import SlurmRayClient
client = SlurmRayClient()
client.start()
# ... pipeline code ...
client.stop()
The head/worker split is determined automatically from SLURM_NODEID. Workers stay inside SlurmRayClient.start() for the duration of the job; the head runs your pipeline.
Required Environment
| Variable | Required | Description |
|---|---|---|
SLURM_NODEID | Yes (set by SLURM) | Identifies the head node (SLURM_NODEID=0) vs workers. |
RAY_PORT_BROADCAST_DIR | Recommended | Directory used to broadcast the head GCS port to workers. Must be on a shared filesystem when /tmp is node-local. Defaults to ${CURATOR_DIR}/logs in the bundled submit scripts. |
RAY_TMPDIR | Optional | Per-job Ray temp directory. The bundled scripts set this to /tmp/ray_${SLURM_JOB_ID} to isolate concurrent jobs on the same node. |
Submit Scripts
The repository ships two reference submit scripts at tutorials/slurm/:
1. NGC Container (Pyxis)
submit_container.sh — launches the NGC NeMo Curator container via Pyxis on each allocated node. Recommended for production because the container ships a known-good environment.
#!/bin/bash
#SBATCH --job-name=curator-slurm-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=2
#SBATCH --time=00:10:00
#SBATCH --output=logs/slurm_demo_%j.log
set -euo pipefail
# Path on shared filesystem, visible from every node
CURATOR_DIR="${CURATOR_DIR:-$(cd "$(dirname "$0")/../.." && pwd)}"
CONTAINER_IMAGE="nvcr.io/nvidia/nemo-curator:`container_version`"
# Shared filesystem path for Ray's port broadcast
export RAY_PORT_BROADCAST_DIR="${CURATOR_DIR}/logs"
export RAY_TMPDIR="/tmp/ray_${SLURM_JOB_ID}"
mkdir -p logs
# Run the pipeline inside the container on every node
srun \
--ntasks-per-node=1 \
--container-image="${CONTAINER_IMAGE}" \
--container-mounts="${CURATOR_DIR}:${CURATOR_DIR}" \
bash -c "
cd '${CURATOR_DIR}'
export RAY_TMPDIR=/tmp/ray_\${SLURM_JOB_ID}
export RAY_PORT_BROADCAST_DIR='${CURATOR_DIR}/logs'
python '${CURATOR_DIR}/tutorials/slurm/pipeline.py' --slurm --num-tasks 80
"Submit:
sbatch --nodes=2 --gpus-per-node=8 tutorials/slurm/submit_container.sh2. Bare-metal (uv)
submit.sh — activates a uv-managed virtualenv and runs the pipeline directly on the node. Useful for development and quick iteration without rebuilding containers.
#!/bin/bash
#SBATCH --job-name=curator-slurm-demo
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=2
#SBATCH --time=00:10:00
#SBATCH --output=logs/slurm_demo_%j.log
set -euo pipefail
CURATOR_DIR="${CURATOR_DIR:-$(cd "$(dirname "$0")/../.." && pwd)}"
export RAY_PORT_BROADCAST_DIR="${CURATOR_DIR}/logs"
export RAY_TMPDIR="/tmp/ray_${SLURM_JOB_ID}"
export UV_CACHE_DIR="${UV_CACHE_DIR:-${HOME}/.cache/uv}"
mkdir -p logs
srun \
--ntasks-per-node=1 \
bash -c "
cd '${CURATOR_DIR}'
export RAY_TMPDIR=/tmp/ray_\${SLURM_JOB_ID}
export RAY_PORT_BROADCAST_DIR='${CURATOR_DIR}/logs'
uv run python '${CURATOR_DIR}/tutorials/slurm/pipeline.py' --slurm --num-tasks 80
"Submit:
sbatch --nodes=2 --gpus-per-node=8 tutorials/slurm/submit.shOverride resources without editing the script:
sbatch --nodes=1 --gpus-per-node=2 tutorials/slurm/submit.sh
sbatch --nodes=1 --gpus-per-node=8 tutorials/slurm/submit.sh
sbatch --nodes=2 --gpus-per-node=2 tutorials/slurm/submit.sh
sbatch --nodes=2 --gpus-per-node=8 tutorials/slurm/submit.shVerified Configurations
End-to-end tested on H100 nodes using the nvcr.io/nvidia/nemo-curator:26.02 container at the time SlurmRayClient was introduced. The reference workload is a word-count + GPU-tagging pipeline shipped in the tutorials/slurm/ directory; it works unchanged on subsequent container versions, including container_version.
| Nodes | GPUs / node | Status | Notes |
|---|---|---|---|
| 1 | 2 | PASS | 1 node processed 80 tasks, 2× H100 80GB |
| 1 | 8 | PASS | 1 node processed 80 tasks, 8× H100 80GB |
| 2 | 2 | PASS | 2 distinct nodes each showed 2× H100 80GB |
| 2 | 8 | PASS | 2 distinct nodes each showed 8× H100 80GB |
Sample output from the 2-node, 8-GPU run:
Tasks processed by 2 distinct node(s):
pool0-00786: 8 GPU(s): NVIDIA H100 80GB HBM3, 81559 MiB; ...
pool0-00795: 8 GPU(s): NVIDIA H100 80GB HBM3, 81559 MiB; ...
Monitoring and Logs
Check job status:
squeue -u "$USER"
View logs (the bundled scripts write to logs/slurm_demo_<jobid>.log):
tail -f logs/slurm_demo_<jobid>.log
The first line of each node’s log identifies its role:
[head-node-name] SLURM_NODEID=0 ray=2.54.0 ...
[worker-node-name] SLURM_NODEID=1 joining head at <ip:port>
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Workers hang at ray start --block indefinitely | Head node’s port broadcast file isn’t visible from worker nodes | Confirm RAY_PORT_BROADCAST_DIR points to a shared filesystem visible from every worker. |
Head node crashes with Permission denied writing the port file | Job UID can’t write to the configured directory | Pick a RAY_PORT_BROADCAST_DIR your job’s UID can write to (the default ${CURATOR_DIR}/logs works once the directory exists). |
| Pipeline runs on only one node | --nodes=1 or the script wasn’t srun-launched on every node | Confirm your job has --nodes>1 and that srun is launched on every node. |
ray start --head crashes with port conflict | Concurrent Ray jobs on the same node share /tmp/ray | Set RAY_TMPDIR=/tmp/ray_${SLURM_JOB_ID} per job (the bundled scripts already do this). |
| Container image pull fails on Pyxis | Pyxis can’t reach NGC | Configure ~/.config/enroot/.credentials with NGC API key, or pre-pull the image with enroot import. |
Performance Considerations
- GPU Memory: ensure all nodes have homogeneous GPU memory; mixed-memory clusters cause Ray to schedule conservatively against the smallest GPU.
- Network: Ray’s object store and shuffle traffic flow over the cluster’s primary network. Use InfiniBand or 100 GbE+ for large-scale jobs.
- Shared FS bandwidth: stage worker reads/writes through the shared filesystem. For S3-backed pipelines, prefer S3 over the shared FS to reduce contention.
- Time limits: SLURM kills jobs at the wall-clock limit. The bundled scripts use
--time=00:10:00for the demo; raise it to match your real workload.
Related Topics
- Deploy Image Curation on Slurm — full image curation pipeline using SLURM and Pyxis.
- Container Environments — image versions and SLURM-specific environment variables.
- Execution Backends — Ray Data, Xenna, and Ray actor pool executor selection.