About Text Curation
NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture .
Use Cases
- Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv
- Create custom text curation pipelines for specific domain needs
- Scale text processing across CPU and GPU clusters efficiently
Architecture
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.
flowchart LR
A["Data Sources<br/>(Cloud, Local,<br/>Common Crawl, arXiv,<br/>Wikipedia)"] --> B["Data Acquisition<br/>& Loading"]
B --> C["Content Processing<br/>& Cleaning"]
C --> D["Quality Assessment<br/>& Filtering"]
D --> E["Deduplication<br/>(Exact, Fuzzy,<br/>Semantic)"]
E --> F["Curated Dataset<br/>(JSONL/Parquet)"]
G["Ray + RAPIDS<br/>(GPU-accelerated)"] -.->|"Distributed Execution"| B
G -.->|"Distributed Execution"| C
G -.->|"GPU Acceleration"| D
G -.->|"GPU Acceleration"| E
classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000
classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000
class A,B,C,D,E stage
class F output
class G infra
Introduction
Master the fundamentals of NeMo Curator and set up your text processing environment.
Learn about pipeline architecture and core processing stages for efficient text curation data-structures distributed architecture
Get StartedLearn prerequisites, setup instructions, and initial configuration for text curation setup configuration quickstart
Curation Tasks
Download Data
Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.
Read existing JSONL and Parquet datasets using Curator’s reader stages jsonl parquet
arXivDownload and extract scientific papers from arXiv academic pdf latex
Common CrawlDownload and extract web archive data from Common Crawl web-data warc distributed
WikipediaDownload and extract Wikipedia articles from Wikipedia dumps articles multilingual dumps
Custom Data SourcesImplement a download and extract pipeline for a custom data source jsonl parquet custom-formats
Process Data
Transform and enhance your text data through comprehensive processing and curation steps.
Handle multilingual content and language-specific processing language-detection stopwords multilingual
Content Processing & CleaningClean, normalize, and transform text content cleaning normalization formatting
DeduplicationRemove duplicate and near-duplicate documents efficiently fuzzy-dedup semantic-dedup exact-dedup
Quality Assessment & FilteringScore and remove low-quality content heuristics classifiers quality-scoring
Specialized ProcessingDomain-specific processing for code and advanced curation tasks code-processing
Synthetic Data GenerationGenerate and augment training data using LLMs llm augmentation multilingual nemotron-cc