Get Started with Image Curation
This guide provides step-by-step instructions for setting up NeMo Curator’s image curation capabilities. Follow these instructions to prepare your environment and execute your first image curation pipeline.
Prerequisites
Ensure your environment meets the following prerequisites for NeMo Curator image curation modules:
- Python 3.10, 3.11, or 3.12
- packaging >= 22.0
- Ubuntu 22.04/20.04
- NVIDIA GPU (required for all image modules)
- Volta™ or higher (compute capability 7.0+)
- CUDA 12 (or above)
Installation Options
You can install NeMo Curator using one of the following methods:
PyPI Installation
Install the image modules from PyPI:
uv pip install "nemo-curator[image_cuda12]"Source Installation
Install the latest version directly from GitHub using uv:
git clone https://github.com/NVIDIA-NeMo/Curator.git
cd Curator
uv sync --extra image_cuda12Activate the environment and run your code:
source .venv/bin/activate
python your_script.pyDocker Container
NeMo Curator is available as a standalone container:
# Pull the container
docker pull nvcr.io/nvidia/nemo-curator:`container_version`
# Run the container
docker run --gpus all -it --rm nvcr.io/nvidia/nemo-curator:`container_version`Download Sample Configuration
NeMo Curator provides a working image curation example in the Image Curation Tutorial. You can adapt this pipeline for your own datasets.
Set Up Data Directory
Create directories to store your image datasets and models:
mkdir -p ~/nemo_curator/data/tar_archives
mkdir -p ~/nemo_curator/data/curated
mkdir -p ~/nemo_curator/models
For this example, you’ll need:
- Tar Archives: JPEG images in
.tarfiles (text and JSON files are ignored during loading) - Model Directory: CLIP and classifier model weights (downloaded automatically on first run)
Basic Image Curation Example
Here’s a simple example to get started with NeMo Curator’s image curation pipeline:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Create image curation pipeline
pipeline = Pipeline(name="image_curation", description="Basic image curation with quality filtering")
# Stage 1: Partition tar files for parallel processing
pipeline.add_stage(FilePartitioningStage(
file_paths="~/nemo_curator/data/tar_archives", # Path to your tar archive directory
files_per_partition=1,
file_extensions=[".tar"],
))
# Stage 2: Read images from tar files using DALI
pipeline.add_stage(ImageReaderStage(
dali_batch_size=50,
verbose=True,
num_threads=4,
num_gpus_per_worker=0.25,
))
# Stage 3: Generate CLIP embeddings for images
pipeline.add_stage(ImageEmbeddingStage(
model_dir="~/nemo_curator/models", # Directory containing model weights
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
remove_image_data=False,
verbose=True,
))
# Stage 4: Filter by aesthetic quality (keep images with score >= 0.5)
pipeline.add_stage(ImageAestheticFilterStage(
model_dir="~/nemo_curator/models",
score_threshold=0.5,
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
verbose=True,
))
# Stage 5: Filter NSFW content (remove images with score >= 0.5)
pipeline.add_stage(ImageNSFWFilterStage(
model_dir="~/nemo_curator/models",
score_threshold=0.5,
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
verbose=True,
))
# Stage 6: Save curated images to new tar archives
pipeline.add_stage(ImageWriterStage(
output_dir="~/nemo_curator/data/curated",
images_per_tar=1000,
remove_image_data=True,
verbose=True,
))
# Execute the pipeline
executor = XennaExecutor()
pipeline.run(executor)
Expected Output
After running the pipeline, you’ll have:
~/nemo_curator/data/curated/
├── images-{hash}-000000.tar # Curated images (first shard)
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar # Curated images (second shard)
├── images-{hash}-000001.parquet # Metadata for corresponding tar
├── ... # Additional shards as needed
Output Format Details:
- Tar Files: Contain high-quality
.jpgfiles that passed both aesthetic and NSFW filtering - Parquet Files: Contain metadata for each corresponding tar file, including image paths, IDs, and processing scores
- Naming Convention: Files use hash-based prefixes (e.g.,
images-a1b2c3d4e5f6-000000.tar) for uniqueness across distributed processing - Scores: Processing metadata includes
aesthetic_scoreandnsfw_scorestored in the Parquet files
Alternative: Using the Complete Tutorial
For a more comprehensive example with data download and more configuration options, see:
# Download the complete tutorial
wget -O ~/nemo_curator/image_curation_example.py https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/tutorials/image/getting-started/image_curation_example.py
# Run with your data
python ~/nemo_curator/image_curation_example.py \
--input-wds-dataset-dir ~/nemo_curator/data/tar_archives \
--output-dataset-dir ~/nemo_curator/data/curated \
--model-dir ~/nemo_curator/models \
--aesthetic-threshold 0.5 \
--nsfw-threshold 0.5
Next Steps
Explore the Image Curation documentation for more advanced processing techniques:
- Tar Archive Loading - Learn about loading JPEG images from tar files
- CLIP Embeddings - Understand embedding generation
- Quality Filtering - Advanced aesthetic and NSFW filtering
- Complete Tutorial - Full working example with data download