ArXiv

Download and extract text from ArXiv LaTeX source bundles using Curator.

ArXiv hosts millions of scholarly papers, typically distributed as LaTeX source inside .tar archives under the s3://arxiv/src/ requester-pays bucket.

How it Works

The ArXiv pipeline in Curator consists of four stages:

URL Generation: Lists available ArXiv source tar files from the S3 bucket
Download: Downloads tar archives using s5cmd (Requester Pays)
Iteration: Extracts LaTeX projects and yields per-paper records
Extraction: Cleans LaTeX and produces plain text

Before You Start

You must have:

An AWS account with credentials configured (profile, environment, or instance role). Access to s3://arxiv/src/ uses S3 Requester Pays; you incur charges for listing and data transfer. If you use aws s3, include the flag --request-payer requester and ensure your AWS credentials are active.
s5cmd installed

# Install s5cmd for requester-pays S3 downloads
pip install s5cmd

The examples on this page use s5cmd and include Requester Pays when running the pipeline.

Usage

Create and run an ArXiv processing pipeline and write outputs to JSONL:

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.download import ArxivDownloadExtractStage
from nemo_curator.stages.text.io.writer import JsonlWriter

def main():
    # Initialize Ray client
    ray_client = RayClient()
    ray_client.start()

    pipeline = Pipeline(
        name="arxiv_pipeline",
        description="Download and process ArXiv LaTeX sources"
    )

    # Add ArXiv stage
    arxiv_stage = ArxivDownloadExtractStage(
        download_dir="./arxiv_downloads",
        url_limit=5,        # optional: number of tar files to process
        record_limit=1000,  # optional: max papers per tar
        add_filename_column=True,
        verbose=True,
    )
    pipeline.add_stage(arxiv_stage)

    # Add writer stage
    writer = JsonlWriter(path="./arxiv_output")
    pipeline.add_stage(writer)

    # Execute
    results = pipeline.run()
    print(f"Completed with {len(results) if results else 0} output files")

    # Stop Ray client
    ray_client.stop()

if __name__ == "__main__":
    main()

For executor options and configuration, refer to Execution Backends.

Parameters

Parameter	Type	Description	Default
`download_dir`	str	Directory to store downloaded `.tar` files	”./arxiv_downloads”
`url_limit`	int \| None	Maximum number of ArXiv tar files to download (useful for testing)	None
`record_limit`	int \| None	Maximum number of papers to extract per tar file	None
`add_filename_column`	bool \| str	Whether to add a source filename column to output; if str, use it as the column name	True (column name defaults to `file_name`)
`log_frequency`	int	How often to log progress while iterating papers	1000
`verbose`	bool	Enable verbose logging during download	False

Output Format

The extractor returns per-paper text; the filename column is optionally added by the pipeline:

{
  "text": "Main body text extracted from LaTeX after cleaning...",
  "file_name": "arXiv_src_2024_01.tar"
}

Field	Description
`text`	Extracted and cleaned paper text (LaTeX macros inlined where supported, comments and references removed)
`file_name`	Optional. Name of the source tar file (enabled by `add_filename_column`)

During iteration the pipeline yields id (ArXiv identifier), source_id (tar base name), and content (a list of LaTeX file contents as strings; one element per .tex file). The final extractor stage emits text plus the optional filename column.