Wikipedia

Download and extract text from Wikipedia Dumps using Curator.

Wikipedia releases compressed dumps of all its content in XML format twice per month. Curator provides a complete pipeline to automatically download, parse, and extract clean text from these dumps.

How it Works

The Wikipedia pipeline in Curator consists of four stages:

URL Generation: Automatically discovers Wikipedia dump URLs for the specified language and date
Download: Downloads compressed .bz2 dump files using wget
Iteration: Parses XML content and extracts individual articles
Extraction: Cleans Wikipedia markup and converts to plain text

Before You Start

Wikipedia publishes new dumps around the first and twentieth of each month. Refer to the English Wikipedia dumps index at https://dumps.wikimedia.org/enwiki/ for available dates.

Curator uses wget to download Wikipedia dumps. You must have wget installed on your system:

On macOS: brew install wget
On Ubuntu/Debian: sudo apt-get install wget
On CentOS/RHEL: sudo yum install wget

Usage

Here’s how to download and extract Wikipedia data using Curator:

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import WikipediaDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

# Initialize Ray client
ray_client = RayClient()
ray_client.start()

# Create and configure pipeline
pipeline = Pipeline(
    name="wikipedia_pipeline",
    description="Download and process Wikipedia dumps"
)

# Create the Wikipedia processing stage
wikipedia_stage = WikipediaDownloadExtractStage(
    language="en",
    download_dir="./wikipedia_downloads",
    dump_date=None,        # None uses latest dump
    url_limit=5,           # Optional: limit number of dump files (useful for testing)
    record_limit=1000,     # Optional: limit articles per dump file
    verbose=True
)
pipeline.add_stage(wikipedia_stage)

# Create writer stage to save results
writer_stage = JsonlWriter(
    path="./wikipedia_output"
)
pipeline.add_stage(writer_stage)

# Execute the pipeline
results = pipeline.run()

# Stop Ray client
ray_client.stop()

For executor options and configuration, refer to Execution Backends.

Parameters

Parameter	Type	Default	Description
`language`	str	”en”	Language code for Wikipedia dump (for example, `en`, `es`, `fr`). Most follow ISO 639‑1, with project-specific exceptions such as `simple`. Refer to Meta‑Wiki List of Wikipedia language editions for supported edition codes and List of ISO 639 language codes for general codes.
`download_dir`	str	”./wikipedia_downloads”	Directory to store downloaded `.bz2` files
`dump_date`	Optional[str]	None	Specific dump date in “YYYYMMDD” format (for example, “20240401”). Dumps are published around the first and twentieth of each month. If None, uses the latest available dump
`wikidumps_index_prefix`	str	”https://dumps.wikimedia.org”	Base URL for Wikipedia dumps index
`verbose`	bool	False	Enable verbose logging during download
`url_limit`	Optional[int]	None	Maximum number of dump URLs to process (useful for testing)
`record_limit`	Optional[int]	None	Maximum number of articles to extract per dump file
`add_filename_column`	bool \| str	True	Whether to add source filename column to output; if str, uses it as the column name (default name: “file_name”)
`log_frequency`	int	1000	How often to log progress during article processing

Output Format

The processed Wikipedia articles become DocumentBatch objects, with each line containing the following fields:

text: The cleaned main text content of the article
title: The title of the Wikipedia article
id: Wikipedia’s unique identifier for the article
url: The constructed Wikipedia URL for the article
language: The language code of the article
source_id: Identifier of the source dump file

If you enable add_filename_column, the output includes an extra field file_name (or your custom column name).

Example Output Record

{
  "text": "Python is a high-level, general-purpose programming language...",
  "title": "Python (programming language)",
  "id": "23862",
  "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
  "language": "en",
  "source_id": "enwiki-20240401-pages-articles-multistream1.xml"
}