Install (All Modalities)
This guide covers installing NeMo Curator with support for all modalities and verifying your installation is working correctly. For a single-modality install or a 30-minute walkthrough, start with one of the modality quickstarts instead.
Before You Start
System Requirements
For comprehensive system requirements and production deployment specifications, refer to Production Deployment Requirements.
Quick Start Requirements:
- OS: Ubuntu 24.04/22.04/20.04 (recommended)
- Python: 3.10, 3.11, or 3.12
- Memory: 16GB+ RAM for basic text processing
- GPU (optional): NVIDIA GPU with 16GB+ VRAM for acceleration
- CUDA 12 (required for
audio_cuda12,video_cuda12,image_cuda12, andtext_cuda12extras)
Development vs Production
| Use Case | Requirements | See |
|---|---|---|
| Local Development | Minimum specs listed above | Continue below |
| Production Clusters | Detailed hardware, network, storage specs | Deployment Requirements |
| Multi-node Setup | Advanced infrastructure planning | Deployment Options |
Installation Methods
Choose one of the following installation methods based on your needs:
PyPI Installation
Install NeMo Curator from the Python Package Index using uv for proper dependency resolution.
-
Install uv:
curl -LsSf https://astral.sh/uv/0.8.22/install.sh | sh source $HOME/.local/bin/env -
Create and activate a virtual environment:
uv venv source .venv/bin/activate -
Install NeMo Curator:
uv pip install torch wheel_stub psutil setuptools setuptools_scm echo "transformers==4.55.2" > override.txt uv pip install --no-build-isolation "nemo-curator[all]" --override override.txt
Source Installation
Install the latest development version directly from GitHub:
# Clone the repository
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
# Install uv if not already available
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install with all extras using uv
uv sync --all-extras --all-groupsContainer Installation (Recommended for Video/Audio)
NeMo Curator is available as a standalone container on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator. The container includes NeMo Curator with all dependencies pre-installed, including FFmpeg with NVENC support.
# Pull the container from NGC
docker pull nvcr.io/nvidia/nemo-curator:`container_version`
# Run the container with GPU support
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:`container_version`Alternatively, you can build the NeMo Curator container locally using the provided Dockerfile:
# Build the container locally
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
docker build -t nemo-curator:latest -f docker/Dockerfile .
# Run the container with GPU support
docker run --gpus all -it --rm nemo-curator:latestBenefits:
- Pre-configured environment with all dependencies (FFmpeg, CUDA libraries)
- Consistent runtime across different systems
- Ideal for production deployments
Install FFmpeg and Encoders (Required for Video)
Curator’s video pipelines rely on FFmpeg for decoding and encoding. If you plan to encode clips (using --transcode-encoder h264_nvenc or --transcode-encoder libvpx-vp9), install FFmpeg with NVENC and libvpx-vp9 support. The maintained install script bundles both.
Debian/Ubuntu (Script)
Use the maintained script in the repository to build and install FFmpeg with NVIDIA NVENC and libvpx-vp9 support. The script enables --enable-cuda-nvcc, --enable-libnpp, and --enable-libvpx.
- Script source: docker/common/install_ffmpeg.sh
curl -fsSL https://raw.githubusercontent.com/NVIDIA-NeMo/Curator/main/docker/common/install_ffmpeg.sh -o install_ffmpeg.sh
chmod +x install_ffmpeg.sh
sudo bash install_ffmpeg.shVerify Installation
Confirm that FFmpeg is on your PATH and that at least one supported encoder is available:
ffmpeg -hide_banner -version | head -n 5
ffmpeg -encoders | grep -E "h264_nvenc|libvpx-vp9" | catIf encoders are missing, reinstall FFmpeg with the required options or use the Debian/Ubuntu script above.
Software H.264/HEVC/AV1 Codec Support (Advanced)
Curator’s default FFmpeg build deliberately ships NVDEC-only decoders for h264, hevc, and av1, and excludes software H.264 encoders (libopenh264, libx264, libx265). This keeps the codec footprint tight and routes every H.264/HEVC/AV1 decode through the GPU.
You may need to add software codec support in two cases:
- H.264 inputs in CPU-only pipeline stages.
VideoReaderandClipWriterinvokeffprobefrom CPU-only Ray actors that can’t see the GPU; they need a softwareh264/hevc/av1decoder to extract metadata. Without it you’ll get aSoftwareCodecMissingErrorpointing back here. - H.264 software encoding (for example, on GPUs without an NVENC encoder block such as A100 or H100, when VP9 isn’t acceptable).
Option 1: Run the bundled installer inside the container (Recommended)
The repository ships a runtime opt-in script that recompiles FFmpeg with software h264/hevc/av1 decoders enabled, optionally including the libopenh264 encoder. It runs inside an existing container — no image rebuild required.
# Inside the container — adds h264/hevc/av1 software decoders only (LGPLv3):
bash /opt/Curator/docker/common/install_h264_support.sh
# Same plus the libopenh264 software h264 ENCODER, so --transcode-encoder=libopenh264 works:
bash /opt/Curator/docker/common/install_h264_support.sh --with-libopenh264
The build takes ~5–10 minutes, replaces /usr/local/bin/{ffmpeg,ffprobe} in place, and pins to the same FFmpeg tag as the image build. Script source: docker/common/install_h264_support.sh.
License notice: the default mode adds only FFmpeg-internal decoders (LGPL). With --with-libopenh264 the binary additionally links Cisco’s OpenH264 (BSD-2-Clause + Cisco-distributed binary license — see https://www.openh264.org/BINARY_LICENSE.txt). You are responsible for any license obligations the resulting binaries impose on your distribution.
Option 2: Use the System FFmpeg
If you’re not using the Curator container, most Linux distributions ship FFmpeg with libx264 (and sometimes libopenh264) preinstalled:
sudo apt-get install -y ffmpeg
ffmpeg -hide_banner -encoders | grep -E "libx264|libopenh264"
Make sure the ffmpeg on your PATH is the one you want — it must shadow Curator’s bundled build.
Option 3: Edit install_ffmpeg.sh and Rebuild the Image
For users distributing customized images, edit docker/common/install_ffmpeg.sh before building the container:
- For software h264/hevc/av1 decoders: append
h264,hevc,av1to the--enable-decoder=...list. - For
libopenh264encoder: addlibopenh264-devto the apt list,libopenh264to--enable-encoder=..., and--enable-libopenh264to the configure flags. - For
libx264encoder: addlibx264-devto the apt list and--enable-libx264 --enable-gplto the configure flags. Note that--enable-gplmakes the resulting FFmpeg binary GPL-licensed.
Then rebuild your image.
Use the Encoder in ClipTranscodingStage
libopenh264 is accepted by ClipTranscodingStage out of the box. At setup time, the stage probes the local FFmpeg build and raises a clear error pointing back to this section if the encoder is not actually compiled in. Once your FFmpeg build includes it, just pass:
python video_split_clip_example.py ... --transcode-encoder libopenh264
For other custom encoders not in SUPPORTED_ENCODERS (for example, libx264), edit nemo_curator/stages/video/clipping/clip_extraction_stages.py to extend the tuple, and add the encoder name to the --transcode-encoder argparse choices list in tutorials/video/getting-started/video_split_clip_example.py:
SUPPORTED_ENCODERS = ("h264_nvenc", "libvpx-vp9", "libopenh264", "libx264") # add yours
Caveats
- Default options for these encoders are not tuned.
ClipTranscodingStageonly sets quality presets forh264_nvencandlibvpx-vp9. Other encoders run with FFmpeg defaults, which may produce different quality/size trade-offs than you expect — see Configure encoders for how to pass an explicit bitrate. - The NeMo Curator team does not test custom encoder configurations. Issues filed against custom encoder builds may be closed.
Package Extras
NeMo Curator provides several installation extras to install only the components you need:
| Extra | Installation Command | Description |
|---|---|---|
| text_cpu | uv pip install nemo-curator[text_cpu] | CPU-only text processing and filtering |
| text_cuda12 | uv pip install nemo-curator[text_cuda12] | GPU-accelerated text processing with RAPIDS |
| audio_cpu | uv pip install nemo-curator[audio_cpu] | CPU-only audio curation with NeMo Toolkit ASR |
| audio_cuda12 | uv pip install nemo-curator[audio_cuda12] | GPU-accelerated audio curation. When using uv, requires transformers==4.55.2 override. |
| image_cpu | uv pip install nemo-curator[image_cpu] | CPU-only image processing |
| image_cuda12 | uv pip install nemo-curator[image_cuda12] | GPU-accelerated image processing with NVIDIA DALI |
| video_cpu | uv pip install nemo-curator[video_cpu] | CPU-only video processing |
| video_cuda12 | uv pip install --no-build-isolation nemo-curator[video_cuda12] | GPU-accelerated video processing with CUDA libraries. Requires FFmpeg and additional build dependencies when using uv. |
| inference_server | uv pip install nemo-curator[inference_server] | Ray Serve + vLLM for serving LLMs alongside curation pipelines |
| sdg_cpu | uv pip install nemo-curator[sdg_cpu] | CPU-only synthetic data generation with Data Designer |
| sdg_cuda12 | uv pip install nemo-curator[sdg_cuda12] | GPU-accelerated synthetic data generation with local inference server support |
Installation Verification
After installation, verify that NeMo Curator is working correctly:
1. Basic Import Test
# Test basic imports
import nemo_curator
print(f"NeMo Curator version: {nemo_curator.__version__}")
# Test core modules
from nemo_curator.pipeline import Pipeline
from nemo_curator.tasks import DocumentBatch
print("✓ Core modules imported successfully")
2. GPU Availability Check
If you installed GPU support, verify GPU access:
# Check GPU availability
try:
import torch
if torch.cuda.is_available():
print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
print("⚠ No GPU detected")
# Check cuDF for GPU deduplication
import cudf
print("✓ cuDF available for GPU-accelerated deduplication")
except ImportError as e:
print(f"⚠ Some GPU modules not available: {e}")
3. Run a Quickstart Tutorial
Try a modality-specific quickstart to see NeMo Curator in action:
- Text Curation Quickstart - Set up and run your first text curation pipeline
- Audio Curation Quickstart - Get started with audio dataset curation
- Image Curation Quickstart - Curate image-text datasets for generative models
- Video Curation Quickstart - Split, encode, and curate video clips at scale