Nemotron-Parse PDF Pipeline
Convert PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model. Unlike traditional text-only PDF parsers, Nemotron-Parse extracts text, images, and reading order in one pass — producing rows directly compatible with the interleaved dataset format.
How it Works
NemotronParsePDFReader is a composite stage that expands into four underlying sub-stages:
PDFPartitioningStage— reads a JSONL manifest of PDF entries and packs them intoFileGroupTaskobjects.PDFPreprocessStage— extracts PDF bytes from the configured source, renders pages to images with scale-to-fit safeguarding against OOM on large pages.NemotronParseInferenceStage— runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, withtext_in_picandenforce_eagerflags and free-port retry on collisions.NemotronParsePostprocessStage— parses model output, aligns images and captions, crops images, and emits the final interleaved rows.
The output is interleaved Parquet ready to be filtered with Interleaved Filters and written to MINT-1T-style WebDataset shards.
Before You Start
Choose your PDF source and confirm the prerequisites:
- GPU: Required. Nemotron-Parse runs on GPU via vLLM (recommended) or Hugging Face Transformers.
- vLLM: Strongly recommended for throughput. Falls back to HF Transformers if
backend="hf"is set. pypdfium2: Required Python dependency for PDF rendering. Installed automatically with theinterleaved_cpuorinterleaved_cuda12extras (e.g.,uv sync --extra interleaved_cuda12).- Manifest: A JSONL file listing the PDFs to process. Each line should specify the PDF location relative to the source directory you choose.
Choosing a PDF Source
Pass exactly one of pdf_dir, zip_base_dir, or jsonl_base_dir so the preprocess stage knows where to find the PDF bytes:
| Parameter | Source Layout | When to Use |
|---|---|---|
pdf_dir | A directory of .pdf files | Local or mounted directories of standalone PDFs |
zip_base_dir | A CC-MAIN-2021-31-PDF-UNTRUNCATED zip hierarchy | Common Crawl PDF dumps |
jsonl_base_dir | JSONL-encoded PDF datasets where each line carries the PDF bytes | GitHub-hosted PDF datasets, custom JSONL collections |
Backend Selection
| Backend | When to Use |
|---|---|
vllm (recommended) | High-throughput GPU inference with batching. Set enforce_eager=True if you hit compilation issues. |
hf | Hugging Face Transformers fallback when vLLM is unavailable or for debugging. |
The inference stage retries on port collisions when binding the vLLM server, so multi-replica deployments on the same node coexist cleanly.
Usage
A minimal end-to-end pipeline that reads PDFs from a directory and writes interleaved Parquet:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.interleaved.pdf.nemotron_parse import NemotronParsePDFReader
from nemo_curator.stages.interleaved.io.writers.tabular import InterleavedParquetWriter
pipeline = Pipeline(name="pdf_to_interleaved")
# 1. Parse PDFs into interleaved rows
pipeline.add_stage(
NemotronParsePDFReader(
manifest_path="./pdfs.jsonl",
pdf_dir="/data/pdfs",
backend="vllm",
pdfs_per_task=10,
max_pages=50,
inference_batch_size=4,
)
)
# 2. Write interleaved Parquet
pipeline.add_stage(InterleavedParquetWriter(output_dir="./parsed_pdfs"))
executor = XennaExecutor()
pipeline.run(executor)
For executor options and configuration, refer to Execution Backends.
Example: CC-MAIN PDF Dump
Parse a Common Crawl PDF dump from its zip hierarchy:
NemotronParsePDFReader(
manifest_path="./cc_pdfs.jsonl",
zip_base_dir="/data/CC-MAIN-2021-31-PDF-UNTRUNCATED",
backend="vllm",
file_names_field="cc_pdf_file_names",
pdfs_per_task=20,
)
Example: JSONL-Encoded PDFs
Parse a JSONL-encoded dataset (e.g., GitHub-hosted PDFs where each line contains the bytes):
NemotronParsePDFReader(
manifest_path="./github_pdfs.jsonl",
jsonl_base_dir="/data/github_pdfs",
backend="vllm",
)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
manifest_path | str | None | None | JSONL manifest listing PDF entries. |
pdf_dir | str | None | None | Directory containing .pdf files. |
zip_base_dir | str | None | None | Root directory of CC-MAIN PDF zip hierarchy. |
jsonl_base_dir | str | None | None | Root directory of JSONL-encoded PDF datasets. |
model_path | str | (default model) | Local path or HF repo ID for the Nemotron-Parse weights. |
backend | str | "vllm" | Inference backend (vllm or hf). |
pdfs_per_task | int | 10 | Number of PDFs grouped into each FileGroupTask. |
max_pdfs | int | None | None | Hard cap on total PDFs processed (debug aid). |
dpi | int | 300 | Render DPI for PDF pages. |
max_pages | int | 50 | Maximum pages rendered per PDF; longer PDFs are truncated. |
inference_batch_size | int | 4 | vLLM/HF batch size. |
max_num_seqs | int | 64 | Maximum concurrent vLLM sequences. |
text_in_pic | bool | False | When True, treat embedded text within rendered images as part of the text content. |
enforce_eager | bool | False | Disable vLLM compilation for compatibility with restricted environments. |
min_crop_px | int | 10 | Minimum dimension (pixels) for cropped image regions. |
dataset_name | str | "pdf_dataset" | Logical dataset label written to output rows. |
file_name_field | str | "file_name" | Manifest field naming a single PDF file. |
file_names_field | str | "cc_pdf_file_names" | Manifest field naming a list of PDF files (CC-MAIN layout). |
url_field | str | "url" | Manifest field for the source URL passthrough. |
Output Format
Each output row represents a single item (text, image, or metadata) from a parsed PDF page. Rows sharing a sample_id belong to the same document. Example output JSON:
{
"sample_id": "doc_42",
"position": 0,
"modality": "text",
"text_content": "# Introduction\n\nThis paper investigates...",
"binary_content": null,
"source_files": ["pdf_42.pdf"],
"url": "https://example.com/pdf_42.pdf"
}
{
"sample_id": "doc_42",
"position": 1,
"modality": "image",
"text_content": null,
"binary_content": "<bytes>",
"source_files": ["pdf_42.pdf"]
}
{
"sample_id": "doc_42",
"position": 2,
"modality": "text",
"text_content": "Figure 1 shows the architecture...",
"binary_content": null,
"source_files": ["pdf_42.pdf"]
}
Output Schema
| Column | Type | Description |
|---|---|---|
sample_id | string | PDF identifier; rows sharing a sample_id belong to the same document. |
position | int | Zero-based item position within the sample, used to reconstruct ordering. |
modality | string | One of text, image, or metadata. |
text_content | string | null | Text payload for text and metadata rows. |
binary_content | bytes | null | Image payload for image rows. |
source_files | list[string] | Source PDF files that produced this row (for lineage tracking). |
The output is directly compatible with Interleaved IO readers and writers — the schema matches INTERLEAVED_SCHEMA exactly.
Render Timeout
The preprocess stage replaces signal.SIGALRM with a multiprocessing fork-based timeout (_RENDER_TIMEOUT_S = 60 by default). This is required because Xenna runs stage workers inside Ray actor processes on non-main threads, where SIGALRM raises ValueError: signal only works in main thread. The forked child inherits the PDF bytes via copy-on-write and is killed if it exceeds the timeout, reliably escaping any hung C-extension code inside pypdfium2.
You don’t need to configure this — it works automatically. If you find legitimate PDFs that take longer than 60 seconds to render, the constant lives at nemo_curator/stages/interleaved/pdf/nemotron_parse/preprocess.py.
Benchmarking
A standalone benchmark script ships at benchmarking/scripts/nemotron_parse_pdf_benchmark.py. Use it to measure throughput on representative datasets before scaling to your full corpus.
Best Practices
- Use vLLM unless you can’t: the
vllmbackend is substantially faster thanhf. Only fall back tohffor debugging or in environments where vLLM is unavailable. - Cap
max_pagesfor outliers: very long PDFs (1000+ pages) can dominate runtime. The default 50 pages handles most academic papers and articles; raise to 200+ for book-length sources. - Tune
pdfs_per_taskfor parallelism: smaller values (5–10) parallelize better across many GPUs; larger values (20–50) reduce per-task overhead on smaller clusters. - Set
enforce_eager=Truein restricted environments: vLLM’s torch.compile path can fail on certain hosts. Disabling compilation trades throughput for compatibility. - Pair with interleaved filters: PDF parsing produces noisy output. Chain with the Interleaved Filters (blur, CLIP score) to drop low-quality samples before training.
Related Topics
- Interleaved IO — readers and writers that consume the Parquet output of this pipeline.
- Interleaved Filters — sample-level filters to apply after parsing.
- Common Crawl — companion source for web-scale PDF input via CC-MAIN dumps.