NVIDIANeMo Curator
Menu

Text Data Curation Pipeline

This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

Architecture Overview

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:

flowchart LR
    A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
    B --> C["Content Processing<br/>& Cleaning"]
    C --> D["Quality Assessment<br/>& Filtering"]
    D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
    E --> F["Curated Dataset<br/>(JSONL/Parquet)"]

    G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra

Pipeline Stages

NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:

1. Data Sources

Multiple input sources provide the foundation for text curation:

  • Cloud storage: Amazon S3, Azure
  • Local workstation: JSONL, Parquet

2. Data Acquisition & Processing

Raw data is downloaded, extracted, and converted into standardized formats:

  • Download & Extraction: Retrieve and process remote data sources
  • Cleaning & Pre-processing: Convert formats and normalize text
  • DocumentBatch Creation: Standardize data into NeMo Curator’s core data structure

3. Quality Assessment & Filtering

Multiple filtering stages ensure data quality:

  • Heuristic Quality Filtering: Rule-based filters for basic quality checks
  • Model-based Quality Filtering: Classification models trained to identify high vs. low quality text

4. Deduplication

Remove duplicate and near-duplicate content:

  • Exact Deduplication: Remove identical documents using MD5 hashing
  • Fuzzy Deduplication: Remove near-duplicates using MinHash and LSH similarity
  • Semantic Deduplication: Remove semantically similar content using embeddings

5. Final Preparation

Prepare the curated dataset for training:

  • Format Standardization: Ensure consistent output format

Infrastructure Foundation

The entire pipeline runs on a robust, scalable infrastructure:

  • Ray: Distributed computing framework for parallelization
  • RAPIDS: GPU-accelerated data processing (cuDF, cuGraph, cuML)
  • Flexible Deployment: CPU and GPU acceleration support

Key Components

The pipeline leverages several core component types:

Processing Modes

The pipeline supports different processing approaches:

GPU Acceleration: Leverage NVIDIA GPUs for:

  • High-throughput data processing
  • ML model inference for classification
  • Embedding generation for semantic operations

CPU Processing: Scale across multiple CPU cores for:

  • Text parsing and cleaning
  • Rule-based filtering
  • Large-scale data transformations

Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.

Scalability & Deployment

The architecture scales from single machines to large clusters:

  • Single Node: Process datasets on laptops or workstations
  • Multi-Node: Distribute processing across cluster resources
  • Cloud Native: Deploy on cloud platforms
  • HPC Integration: Run on HPC supercomputing clusters

For hands-on experience, refer to the Text Curation Getting Started Guide .