Text Data Curation Pipeline

This guide provides a comprehensive overview of NeMo Curator’s text curation pipeline architecture, from data acquisition through final dataset preparation.

Architecture Overview

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture:

flowchart LR
    A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
    B --> C["Content Processing<br/>& Cleaning"]
    C --> D["Quality Assessment<br/>& Filtering"]
    D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
    E --> F["Curated Dataset<br/>(JSONL/Parquet)"]

    G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra

Pipeline Stages

NeMo Curator’s text curation pipeline consists of several key stages that work together to transform raw data sources into high-quality datasets ready for LLM training:

1. Data Sources

Multiple input sources provide the foundation for text curation:

Cloud storage: Amazon S3, Azure
Local workstation: JSONL, Parquet

2. Data Acquisition & Processing

Raw data is downloaded, extracted, and converted into standardized formats:

Download & Extraction: Retrieve and process remote data sources
Cleaning & Pre-processing: Convert formats and normalize text
DocumentBatch Creation: Standardize data into NeMo Curator’s core data structure

3. Quality Assessment & Filtering

Multiple filtering stages ensure data quality:

Heuristic Quality Filtering: Rule-based filters for basic quality checks
Model-based Quality Filtering: Classification models trained to identify high vs. low quality text

4. Deduplication

Remove duplicate and near-duplicate content:

Exact Deduplication: Remove identical documents using MD5 hashing
Fuzzy Deduplication: Remove near-duplicates using MinHash and LSH similarity
Semantic Deduplication: Remove semantically similar content using embeddings

5. Final Preparation

Prepare the curated dataset for training:

Format Standardization: Ensure consistent output format

Infrastructure Foundation

The entire pipeline runs on a robust, scalable infrastructure:

Ray: Distributed computing framework for parallelization
RAPIDS: GPU-accelerated data processing (cuDF, cuGraph, cuML)
Flexible Deployment: CPU and GPU acceleration support

Key Components

The pipeline leverages several core component types:

Data Loading

Core concepts for loading and managing text datasets from local files

Data Acquisition

Components for downloading and extracting data from remote sources

Data Processing

Concepts for filtering, deduplication, and classification

Processing Modes

The pipeline supports different processing approaches:

GPU Acceleration: Leverage NVIDIA GPUs for:

High-throughput data processing
ML model inference for classification
Embedding generation for semantic operations

CPU Processing: Scale across multiple CPU cores for:

Text parsing and cleaning
Rule-based filtering
Large-scale data transformations

Hybrid Workflows: Combine CPU and GPU processing for optimal performance based on the specific operation.

Scalability & Deployment

The architecture scales from single machines to large clusters:

Single Node: Process datasets on laptops or workstations
Multi-Node: Distribute processing across cluster resources
Cloud Native: Deploy on cloud platforms
HPC Integration: Run on HPC supercomputing clusters

For hands-on experience, refer to the Text Curation Getting Started Guide .