Common Crawl
Download and extract text from Common Crawl snapshots using Curator.
Common Crawl provides petabytes of web data collected over years of web crawling. The data uses a compressed web archive format (.warc.gz), which requires processing to extract useful text for language model training.
How it Works
Curator’s Common Crawl processing pipeline consists of four sequential stages:
- URL Generation: Generates WARC file URLs from Common Crawl’s index for the specified snapshot range
- Download: Downloads the compressed WARC files from Common Crawl’s servers (optionally using S3 for faster downloads)
- Iteration: Extracts individual records from WARC files and decodes HTML content
- Extraction: Performs language detection and extracts clean text using configurable HTML extraction algorithms
The pipeline outputs structured data that you can write to JSONL or Parquet files for further processing.
WARC Record Reader
For pipelines that already have WARC metadata (such as warc_filename, warc_record_offset, and warc_record_length columns from a CC Index lookup), use CommonCrawlWARCReader to fetch individual WARC records directly via byte-range requests — without downloading full WARC files.
CommonCrawlWARCReader supports two transport modes:
- HTTPS (default): Fetches records from
data.commoncrawl.orgusing therequestslibrary. No AWS credentials required. - S3: Fetches records from the
commoncrawlS3 bucket usingboto3range requests. Activate withuse_s3=Trueor by setting theCC_USE_S3=1environment variable. Credentials are resolved through boto3’s standard chain (environment variables,~/.aws/config, instance profiles).
Before You Start
Choose your download method and ensure you have the prerequisites:
- HTTPS downloads (default): No AWS account required.
- S3 downloads (set
use_aws_to_download=True):- An AWS account with credentials configured (profile, environment, or instance role).
- Common Crawl’s S3 access uses Requester Pays; you incur charges for requests and data transfer.
s5cmdinstalled for fast S3 listing and copy operations:
# Install s5cmd for faster S3 downloads
pip install s5cmd
Usage
Here’s how to create and run a Common Crawl processing pipeline:
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import CommonCrawlDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter
def main():
# Initialize Ray client
ray_client = RayClient()
ray_client.start()
# Create pipeline
pipeline = Pipeline(
name="common_crawl_pipeline",
description="Download and process Common Crawl data"
)
# Add Common Crawl processing stage
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50", # YYYY-WW format for CC-MAIN
end_snapshot="2020-50",
download_dir="./cc_downloads",
crawl_type="main", # or "news"
use_aws_to_download=True, # Faster S3 downloads (requires s5cmd)
url_limit=10, # Limit number of WARC files for testing
record_limit=1000, # Limit records per WARC file
)
pipeline.add_stage(cc_stage)
# Add output writer stage
writer = JsonlWriter("./cc_output")
pipeline.add_stage(writer)
# Run pipeline
results = pipeline.run()
# Stop Ray client
ray_client.stop()
if __name__ == "__main__":
main()
For executor options and configuration, refer to Execution Backends.
Writing to Parquet
To write to Parquet files instead of JSONL, use ParquetWriter:
from nemo_curator.stages.text.io.writer import ParquetWriter
# Replace the JSONL writer with ParquetWriter
writer = ParquetWriter("./cc_output_parquet")
pipeline.add_stage(writer)
Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
start_snapshot | str | First snapshot to include (format: “YYYY-WW” for main, “YYYY-MM” for news). Not every year and week has a snapshot; refer to the official list at https://data.commoncrawl.org/. | Required |
end_snapshot | str | Last snapshot to include (same format as start_snapshot). Ensure your range includes at least one valid snapshot. | Required |
download_dir | str | Directory to store downloaded WARC files | Required |
crawl_type | Literal[“main”, “news”] | Whether to use CC-MAIN or CC-NEWS dataset | ”main” |
html_extraction | HTMLExtractorAlgorithm | str | None | Text extraction algorithm to use. Defaults to JusTextExtractor() if not specified. | JusTextExtractor() if not specified |
html_extraction_kwargs | dict | None | Additional arguments for the HTML extractor. Ignored when html_extraction is a concrete extractor object (for example, JusTextExtractor()); pass kwargs to the extractor constructor instead. When html_extraction is a string (“justext”, “resiliparse”, or “trafilatura”), kwargs are forwarded. | None |
stop_lists | dict[str, frozenset[str]] | None | Language-specific stop words for text quality assessment. If not provided, Curator uses jusText defaults with additional support for Thai, Chinese, and Japanese languages. | None |
use_aws_to_download | bool | Use S3 downloads via s5cmd instead of HTTPS (requires s5cmd installation) | False |
verbose | bool | Enable verbose logging for download operations | False |
url_limit | int | None | Maximum number of WARC files to download (useful for testing) | None |
record_limit | int | None | Maximum number of records to extract per WARC file | None |
add_filename_column | bool | str | Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”) | True |
extractor_max_calls_per_worker | int | None | Restart Ray Data worker processes after this many tasks to mitigate memory fragmentation. Auto-set to 2 for JusTextExtractor. | Auto (2 for jusText, None otherwise) |
Output Format
The pipeline processes Common Crawl data through several stages, ultimately producing structured documents. The extracted text includes the following fields:
{
"url": "http://example.com/page.html",
"warc_id": "a515a7b6-b6ec-4bed-998b-8be2f86f8eac",
"source_id": "CC-MAIN-20201123153826-20201123183826-00000.warc.gz",
"language": "ENGLISH",
"text": "Extracted web page content..."
}
| Field | Description |
|---|---|
url | Original URL of the web page |
warc_id | Unique identifier for the WARC record |
source_id | Name of the source WARC file |
language | Detected language of the content (e.g., “ENGLISH”, “SPANISH”) |
text | Extracted and cleaned text content |
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Customization Options
HTML Text Extraction Algorithms
Curator supports several HTML text extraction algorithms:
| Extractor | Library |
|---|---|
JusTextExtractor | jusText |
ResiliparseExtractor | Resiliparse |
TrafilaturaExtractor | Trafilatura |
Configuring HTML Extractors
from nemo_curator.stages.text.download.html_extractors import ResiliparseExtractor
from nemo_curator.stages.text.download.html_extractors import TrafilaturaExtractor
# Use Resiliparse for extraction
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="./downloads",
html_extraction=ResiliparseExtractor(
required_stopword_density=0.25,
main_content=True
)
)
# Or use Trafilatura with custom parameters
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="./downloads",
html_extraction=TrafilaturaExtractor(
min_extracted_size=200,
max_repetitions=3
)
)
Language Processing
You can customize language detection and extraction by providing stop words for different languages:
# Define custom stop words for specific languages
stop_lists = {
"ENGLISH": frozenset(["the", "and", "is", "in", "for", "where", "when", "to", "at"]),
"SPANISH": frozenset(["el", "la", "de", "que", "y", "en", "un", "es", "se", "no"])
}
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="./downloads",
stop_lists=stop_lists
)
WARC Record Reader Usage
Use CommonCrawlWARCReader when your dataset already contains WARC metadata columns from a CC Index lookup:
from nemo_curator.stages.text.download.common_crawl.download import CommonCrawlWARCReader
# HTTPS transport (default)
warc_reader = CommonCrawlWARCReader(
warc_filename_col="warc_filename",
warc_record_offset_col="warc_record_offset",
warc_record_length_col="warc_record_length",
max_workers=16,
)
# S3 transport
warc_reader = CommonCrawlWARCReader(
warc_filename_col="warc_filename",
warc_record_offset_col="warc_record_offset",
warc_record_length_col="warc_record_length",
use_s3=True,
max_workers=16,
)
WARC Record Reader Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
warc_filename_col | str | Column name containing the WARC filename | "warc_filename" |
warc_record_offset_col | str | Column name containing the byte offset | "warc_record_offset" |
warc_record_length_col | str | Column name containing the record length | "warc_record_length" |
binary_content_col | str | Output column name for fetched content | "binary_content" |
drop_failed | bool | Drop rows where the fetch failed | True |
max_workers | int | Number of parallel threads for fetching | 16 |
timeout | int | Request timeout in seconds | 30 |
max_retries | int | Number of retries for failed requests | 3 |
use_s3 | bool | None | Use S3 transport instead of HTTPS. If None, reads the CC_USE_S3 environment variable (accepted values: 1, true, yes). | None |
s3_bucket | str | None | S3 bucket name. Falls back to the CC_S3_BUCKET environment variable, then "commoncrawl". | None |
s3_key_prefix | str | None | Prefix to strip from warc_filename when building the S3 object key. Falls back to the CC_S3_KEY_PREFIX environment variable. | None |
Environment Variables
You can configure CommonCrawlWARCReader S3 transport using environment variables instead of constructor parameters:
| Variable | Description | Example |
|---|---|---|
CC_USE_S3 | Enable S3 transport | 1, true, yes |
CC_S3_BUCKET | Override the S3 bucket name | commoncrawl |
CC_S3_KEY_PREFIX | Prefix to strip from warc_filename for S3 key construction | crawl-data/ |
Advanced Usage
Processing CC-NEWS Data
For Common Crawl News data, use the news crawl type with month-based snapshots:
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-08", # YYYY-MM format for CC-NEWS
end_snapshot="2020-10",
download_dir="./news_downloads",
crawl_type="news" # Use CC-NEWS instead of CC-MAIN
)
See https://data.commoncrawl.org/crawl-data/CC-NEWS/index.html for more information.
Large-Scale Processing
For production workloads, consider these optimizations:
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="/fast_storage/cc_downloads",
use_aws_to_download=True, # Faster S3 downloads
verbose=False, # Reduce logging overhead
# Remove limits for full processing
# url_limit=None,
# record_limit=None
)
Memory Management for Extraction
JusText extraction relies on lxml/libxml2, which can cause C-heap memory fragmentation during long-running jobs. Over many WARC files, this fragmentation causes resident memory to grow until workers run out of memory.
To mitigate this, Curator automatically sets extractor_max_calls_per_worker=2 when using JusTextExtractor. This restarts Ray Data worker processes every two tasks, reclaiming fragmented memory. You can override this value:
# Increase recycling frequency for very memory-constrained environments
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="./downloads",
extractor_max_calls_per_worker=1, # Recycle after every task
)
# Disable worker recycling (not recommended for large jobs with jusText)
cc_stage = CommonCrawlDownloadExtractStage(
start_snapshot="2020-50",
end_snapshot="2020-50",
download_dir="./downloads",
extractor_max_calls_per_worker=None, # No recycling
)