Wikipedia
Download and extract text from Wikipedia Dumps using Curator.
Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.
How it Works
The Wikipedia pipeline in Curator consists of four stages:
- URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
- Download: Downloads compressed
.bz2dump files usingwget - Iteration: Parses XML content and extracts individual articles
- Extraction: Cleans Wikipedia markup and converts to plain text
Before You Start
Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.
Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:
- On macOS:
brew install wget - On Ubuntu/Debian:
sudo apt-get install wget - On CentOS/RHEL:
sudo yum install wget
Usage
Here’s how to download and extract Wikipedia data using Curator:
from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter
# Initialize Ray client
ray_client = RayClient()
ray_client.start()
# Create and configure pipeline
pipeline = Pipeline(
name="wikipedia_pipeline",
description="Download and process Wikipedia dumps"
)
# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
language="en",
download_dir="./wikipedia_downloads",
dump_date=None, # None uses latest dump
url_limit=5, # Optional: limit number of dump files (useful for testing)
record_limit=1000, # Optional: limit articles per dump file
verbose=True
)
pipeline.add_stage(wikipedia_stage)
# Create writer stage to save results
writer_stage = JsonlWriter(
path="./wikipedia_output"
)
pipeline.add_stage(writer_stage)
# Execute the pipeline
results = pipeline.run()
# Stop Ray client
ray_client.stop()
For executor options and configuration, refer to Execution Backends.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
language | str | ”en” | Language code for Wikipedia dump (for example, en, es, fr). Most follow ISO 639‑1, with project-specific exceptions such as simple. Refer to Meta‑Wiki List of Wikipedia language editions for supported edition codes and List of ISO 639 language codes for general codes. |
download_dir | str | ”./wikipedia_downloads” | Directory to store downloaded .bz2 files |
dump_date | Optional[str] | None | Specific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump |
wikidumps_index_prefix | str | ”https://dumps.wikimedia.org” | Base URL for Wikipedia dumps index |
verbose | bool | False | Enable verbose logging during download |
url_limit | Optional[int] | None | Maximum number of dump URLs to process (useful for testing) |
record_limit | Optional[int] | None | Maximum number of articles to extract per dump file |
add_filename_column | bool | str | True | Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”) |
log_frequency | int | 1000 | How often to log progress during article processing |
Output Format
The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:
text: The cleaned main text content of the articletitle: The title of the Wikipedia articleid: Wikipedia’s unique identifier for the articleurl: The constructed Wikipedia URL for the articlelanguage: The language code of the articlesource_id: Identifier of the source dump file
If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).
Example Output Record
{
"text": "Python is a high-level, general-purpose programming language...",
"title": "Python (programming language)",
"id": "23862",
"url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
"language": "en",
"source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}