Nemotron-Parse PDF Pipeline

Convert PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the interleaved dataset format.

How it Works

NemotronParsePDFReader is a composite stage that expands into four underlying sub-stages:

PDFPartitioningStage — reads a JSONL manifest of PDF entries and packs them into FileGroupTask objects.
PDFPreprocessStage — extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.
NemotronParseInferenceStage — runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with text_in_pic and enforce_eager flags and free-port retry on collisions.
NemotronParsePostprocessStage — parses model output, aligns images and captions, crops images, and emits the final interleaved rows.

The output is interleaved Parquet ready to be filtered with Interleaved Filters and written to MINT-1T-style WebDataset shards.

Before You Start

Choose your PDF source and confirm the prerequisites:

GPU: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
vLLM: Strongly recommended for throughput. Falls back to HF Transformers if backend="hf" is set.
pypdfium2: Required Python dependency for PDF rendering. Installed automatically with the interleaved_cpu or interleaved_cuda12 extras (e.g., uv sync --extra interleaved_cuda12).
Manifest: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.

Choosing a PDF Source

Pass exactly one of pdf_dir, zip_base_dir, or jsonl_base_dir so the preprocess stage knows where to find the PDF bytes:

Parameter	Source Layout	When to Use
`pdf_dir`	A directory of `.pdf` files	Local or mounted directories of standalone PDFs
`zip_base_dir`	A `CC-MAIN-2021-31-PDF-UNTRUNCATED` zip hierarchy	Common Crawl PDF dumps
`jsonl_base_dir`	JSONL-encoded PDF datasets where each line carries the PDF bytes	GitHub-hosted PDF datasets, custom JSONL collections

Backend Selection

Backend	When to Use
`vllm` (recommended)	High-throughput GPU inference with batching. Set `enforce_eager=True` if you hit compilation issues.
`hf`	Hugging Face Transformers fallback when vLLM is unavailable or for debugging.

The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.

Usage

A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader
from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter

pipeline = Pipeline(name="pdf_to_interleaved")

# 1. Parse PDFs into interleaved rows
pipeline.add_stage(
    NemotronParsePDFReader(
        manifest_path="./pdfs.jsonl",
        pdf_dir="/data/pdfs",
        backend="vllm",
        pdfs_per_task=10,
        max_pages=50,
        inference_batch_size=4,
    )
)

# 2. Write interleaved Parquet
pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs"))

executor = XennaExecutor()
pipeline.run(executor)

For executor options and configuration, refer to Execution Backends.

Example: CC-MAIN PDF Dump

Parse a Common Crawl PDF dump from its zip hierarchy:

NemotronParsePDFReader(
    manifest_path="./cc_pdfs.jsonl",
    zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED",
    backend="vllm",
    file_names_field="cc_pdf_file_names",
    pdfs_per_task=20,
)

Example: JSONL-Encoded PDFs

Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):

NemotronParsePDFReader(
    manifest_path="./github_pdfs.jsonl",
    jsonl_base_dir="/data/github_pdfs",
    backend="vllm",
)

Parameters

Parameter	Type	Default	Description
`manifest_path`	str \| None	`None`	JSONL manifest listing PDF entries.
`pdf_dir`	str \| None	`None`	Directory containing `.pdf` files.
`zip_base_dir`	str \| None	`None`	Root directory of CC-MAIN PDF zip hierarchy.
`jsonl_base_dir`	str \| None	`None`	Root directory of JSONL-encoded PDF datasets.
`model_path`	str	(default model)	Local path or HF repo ID for the Nemotron-Parse weights.
`backend`	str	`"vllm"`	Inference backend (`vllm` or `hf`).
`pdfs_per_task`	int	`10`	Number of PDFs grouped into each `FileGroupTask`.
`max_pdfs`	int \| None	`None`	Hard cap on total PDFs processed (debug aid).
`dpi`	int	`300`	Render DPI for PDF pages.
`max_pages`	int	`50`	Maximum pages rendered per PDF; longer PDFs are truncated.
`inference_batch_size`	int	`4`	vLLM/HF batch size.
`max_num_seqs`	int	`64`	Maximum concurrent vLLM sequences.
`text_in_pic`	bool	`False`	When `True`, treat embedded text within rendered images as part of the text content.
`enforce_eager`	bool	`False`	Disable vLLM compilation for compatibility with restricted environments.
`min_crop_px`	int	`10`	Minimum dimension (pixels) for cropped image regions.
`dataset_name`	str	`"pdf_dataset"`	Logical dataset label written to output rows.
`file_name_field`	str	`"file_name"`	Manifest field naming a single PDF file.
`file_names_field`	str	`"cc_pdf_file_names"`	Manifest field naming a list of PDF files (CC-MAIN layout).
`url_field`	str	`"url"`	Manifest field for the source URL passthrough.

Output Format

Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a sample_id belong to the same document. Example output JSON:

{
  "sample_id": "doc_42",
  "position": 0,
  "modality": "text",
  "text_content": "# Introduction\n\nThis paper investigates...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"],
  "url": "https://example.com/pdf_42.pdf"
}
{
  "sample_id": "doc_42",
  "position": 1,
  "modality": "image",
  "text_content": null,
  "binary_content": "<bytes>",
  "source_files": ["pdf_42.pdf"]
}
{
  "sample_id": "doc_42",
  "position": 2,
  "modality": "text",
  "text_content": "Figure 1 shows the architecture...",
  "binary_content": null,
  "source_files": ["pdf_42.pdf"]
}

Output Schema

Column	Type	Description
`sample_id`	string	PDF identifier; rows sharing a `sample_id` belong to the same document.
`position`	int	Zero-based item position within the sample, used to reconstruct ordering.
`modality`	string	One of `text`, `image`, or `metadata`.
`text_content`	string \| null	Text payload for `text` and `metadata` rows.
`binary_content`	bytes \| null	Image payload for `image` rows.
`source_files`	list[string]	Source PDF files that produced this row (for lineage tracking).

The output is directly compatible with Interleaved IO readers and writers — the schema matches INTERLEAVED_SCHEMA exactly.

Render Timeout

The preprocess stage replaces signal.SIGALRM with a multiprocessing fork-based timeout (_RENDER_TIMEOUT_S = 60 by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where SIGALRM raises ValueError: signal only works in main thread. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside pypdfium2.

You don’t need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py.

Benchmarking

A standalone benchmark script ships at benchmarking/scripts/nemotron_parse_pdf_benchmark.py. Use it to measure throughput on representative datasets before scaling to your full corpus.

Best Practices

Use vLLM unless you can’t: the vllm backend is substantially faster than hf. Only fall back to hf for debugging or in environments where vLLM is unavailable.
Cap max_pages for outliers: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources.
Tune pdfs_per_task for parallelism: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters.
Set enforce_eager=True in restricted environments: vLLM’s torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility.
Pair with interleaved filters: PDF parsing produces noisy output. Chain with the Interleaved Filters (blur, CLIP score) to drop low-quality samples before training.

Interleaved IO — readers and writers that consume the Parquet output of this pipeline.
Interleaved Filters — sample-level filters to apply after parsing.
Common Crawl — companion source for web-scale PDF input via CC-MAIN dumps.