Key Features
NeMo Curator is an enterprise-grade platform for scalable, privacy-aware data curation across text, image, video, and audio. It empowers teams to prepare high-quality, compliant datasets for LLM and AI training, with robust support for distributed, cloud-native, and on-premises workflows. Leading organizations trust NeMo Curator for its modular pipelines, advanced filtering, and seamless integration with modern MLOps environments.
Why NeMo Curator?
- Trusted by leading organizations for LLM and generative AI data curation
- Open source, NVIDIA-supported, and actively maintained
- Seamless integration with enterprise MLOps and data platforms
- Proven at scale: from laptops to multi-node GPU clusters
Benchmarks & Results
- Deduplicated 1.96 trillion tokens in 0.5 hours using 32 NVIDIA H100 GPUs (RedPajama V2 scale)
- Up to 80% data reduction and significant improvements in downstream model performance (see ablation studies)
- Efficient curation of Common Crawl: from 2.8TB raw to 0.52TB high-quality data in under 38 hours on 30 CPU nodes
Text Data Curation
NeMo Curator offers advanced tools for text data loading, cleaning, filtering, deduplication, and classification. Built-in modules support language identification, quality estimation, domain and safety classification. Pipelines are fully modular and can be customized for diverse NLP and LLM training needs.
Efficiently load and manage massive text datasets, with support for common formats and scalable streaming.
Data ProcessingAdvanced filtering, deduplication, classification, and pipeline design for high-quality text curation.
Text Curation QuickstartSet up your environment and run your first text curation pipeline with NeMo Curator.
Image Data Curation
NeMo Curator supports scalable image dataset loading, embedding, classification (aesthetic, NSFW, etc.), filtering, deduplication, and export. It leverages state-of-the-art vision models (for example, CLIP, timm) with pipeline-based architecture for efficient GPU-accelerated processing. Modular pipelines enable rapid experimentation and integration with text and multimodal workflows.
Load and manage large-scale image datasets for curation workflows.
Data ProcessingEmbedding generation, classification (aesthetic, NSFW), filtering, and deduplication for images.
Data ExportExport, save, and reshard curated image datasets for downstream use.
Image Curation QuickstartSet up your environment and install NeMo Curator’s image modules.
Audio Data Curation
NeMo Curator provides speech and audio curation capabilities designed for preparing high-quality speech datasets for ASR model training and multimodal applications. It leverages pretrained .nemo model checkpoints via the NeMo Framework for transcription, quality assessment through Word Error Rate (WER) calculation, and seamless integration with text curation workflows.
Load and manage audio datasets with manifests, file paths, and transcriptions for curation workflows.
ASR ProcessingAutomatic speech recognition inference, quality assessment, and transcription using NeMo Framework models.
Quality AssessmentWord Error Rate (WER) calculation, duration analysis, and quality-based filtering for speech data.
Audio Curation QuickstartSet up your environment and run your first audio curation pipeline with NeMo Curator.
Video Data Curation
NeMo Curator provides distributed video curation pipelines, supporting scalable data flow, pipeline stages, and efficient processing for large video corpora. The architecture supports high-throughput, cloud-native, and on-prem deployments.
Distributed processing, Ray-based foundation, and autoscaling for video curation.
Key AbstractionsStages, pipelines, and execution modes in video curation workflows.
Data FlowHow data moves through the system, from ingestion to output, for efficient large-scale video curation.
Video Curation QuickstartSet up your environment and run your first video curation pipeline with NeMo Curator.
Deployment and Integration
NeMo Curator is designed for distributed, cloud-native, and on-premises deployments. It integrates easily with your existing MLOps pipelines. Modular APIs enable flexible orchestration and automation.
See the Admin Guide for deployment guidance and infrastructure recommendations.
Memory ManagementOptimize memory usage and partitioning for large-scale curation workflows.
GPU AccelerationLeverage NVIDIA GPUs for faster data processing and pipeline acceleration.
Resumable ProcessingContinue interrupted operations and recover large dataset processing with checkpointing and batching.