NVIDIANeMo Curator
Menu

About Text Curation

NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture .

Use Cases

  • Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
  • Create custom text curation pipelines for specific domain needs
  • Scale text processing across CPU and GPU clusters efficiently

Architecture

The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.

flowchart LR
    A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
    B --> C["Content Processing<br/>& Cleaning"]
    C --> D["Quality Assessment<br/>& Filtering"]
    D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
    E --> F["Curated Dataset<br/>(JSONL/Parquet)"]

    G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
    G -.->|"Distributed Execution"| C
    G -.->|"GPU Acceleration"| D
    G -.->|"GPU Acceleration"| E

    classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
    classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000

    class A,B,C,D,E stage
    class F output
    class G infra

Introduction

Master the fundamentals of NeMo Curator and set up your text processing environment.

Curation Tasks

Download Data

Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.

Process Data

Transform and enhance your text data through comprehensive processing and curation steps.