Interleaved Filters
Drop low-quality samples from an InterleavedBatch before downstream training or further processing. Four filter stages are available, each targeting a different quality signal.
Understanding the Filters
What Each Filter Does
| Filter | Targets | Cost | Default Threshold |
|---|---|---|---|
InterleavedBlurFilterStage | Out-of-focus / motion-blurred images | Cheap (CPU) | score_threshold=100.0 |
InterleavedQRCodeFilterStage | Promotional / contact-info imagery | Cheap (CPU) | score_threshold=0.05 |
InterleavedImageToTextRatioFilterStage | Image-dump or text-dump samples | Cheap (CPU) | No filtering by default |
InterleavedCLIPScoreFilterStage | Misaligned image-text pairs | Expensive (GPU, ~20 GB) | min_score=0.15 |
Recommended Filter Order
Chain cheap filters first to reduce the number of samples expensive filters have to score. A typical order:
Blur → QR-code → Image-to-Text Ratio → CLIP Score
The CLIP filter dominates cost. Putting it last means it only runs against samples that survived all other checks.
InterleavedBlurFilterStage
Computes the Laplacian variance of each image (via OpenCV) and drops images below score_threshold. Lower variance means a flatter, blurrier image.
Threshold Guidelines
score_threshold | Effect |
|---|---|
| 50 | Permissive — drops only severely blurred images |
| 100 (default) | Balanced — typical for general curation |
| 200 | Strict — keeps only sharp, high-detail images |
| 500+ | Very strict — useful for studio-quality requirements |
Usage
from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
| Parameter | Type | Default | Description |
|---|---|---|---|
score_threshold | float | 100.0 | Minimum Laplacian variance to keep an image. Higher = sharper-only output. |
InterleavedQRCodeFilterStage
Detects QR codes in each image (OpenCV) and drops samples whose largest QR-code bounding box exceeds score_threshold as a fraction of the total image area. Useful for stripping promotional or contact-info imagery from web crawls.
Threshold Guidelines
score_threshold | Effect |
|---|---|
| 0.01 | Very strict — drops anything with even a small QR code |
| 0.05 (default) | Balanced — drops images dominated by QR codes |
| 0.1+ | Permissive — only drops near-fullscreen QR codes |
Usage
from nemo_curator.stages.interleaved.filter.qrcode_filter import InterleavedQRCodeFilterStage
pipeline.add_stage(InterleavedQRCodeFilterStage(score_threshold=0.05))
| Parameter | Type | Default | Description |
|---|---|---|---|
score_threshold | float | 0.05 | Maximum allowed QR-bounding-box area as a fraction of total image area. Lower = stricter. |
InterleavedImageToTextRatioFilterStage
Computes the per-sample ratio of image count to text word count and drops samples outside a configurable min_ratio/max_ratio window. Useful for excluding image-dump samples (no text) and text-heavy samples with few or no images.
Range Guidelines
For mixed image-text pretraining data:
| Use Case | min_ratio | max_ratio |
|---|---|---|
| Balanced multimodal | 0.001 | 0.1 |
| Image-rich (caption-style) | 0.01 | 0.5 |
| Text-rich (article-style) | 0.0001 | 0.01 |
A min_ratio=0.001 means “at least one image per 1000 text words.” A max_ratio=0.1 means “no more than one image per 10 text words.”
Usage
from nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter import (
InterleavedImageToTextRatioFilterStage,
)
pipeline.add_stage(
InterleavedImageToTextRatioFilterStage(min_ratio=0.001, max_ratio=0.1)
)
| Parameter | Type | Default | Description |
|---|---|---|---|
min_ratio | float | 0.0 | Minimum images-per-word ratio to keep. |
max_ratio | float | inf | Maximum images-per-word ratio to keep. |
InterleavedCLIPScoreFilterStage
Uses a CLIP image-text encoder to score each (image, text) pair by cosine similarity, and drops samples whose alignment is below min_score. Ensures the textual content of each sample actually describes the image.
Threshold Guidelines
CLIP scores depend on the model variant; the table below assumes the default model:
min_score | Effect |
|---|---|
| 0.10 | Permissive — keeps loosely aligned pairs (web crawl baseline) |
| 0.15 (default) | Balanced — drops mostly unrelated text-image pairs |
| 0.20 | Stricter — high-quality alignment for caption-style data |
| 0.30+ | Very strict — for caption datasets with manual review |
Usage
from nemo_curator.stages.interleaved.filter.clip_score_filter import (
InterleavedCLIPScoreFilterStage,
)
pipeline.add_stage(
InterleavedCLIPScoreFilterStage(
model_dir="/models/clip",
min_score=0.15,
)
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model_dir | str | None | None | Local CLIP model directory. When None, the default model is downloaded on first use. |
min_score | float | 0.15 | Minimum cosine similarity to keep a sample. |
The default resource allocation reserves gpu_memory_gb=20.0. Tune this on the stage’s Resources for smaller or larger CLIP variants.
Complete Filtering Pipeline
A pipeline that stacks all four filters in cost order, then writes to WDS:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.interleaved.io.reader import InterleavedParquetReader
from nemo_curator.stages.interleaved.io.writers.webdataset import (
InterleavedWebdatasetWriterStage,
)
from nemo_curator.stages.interleaved.filter.blur_filter import InterleavedBlurFilterStage
from nemo_curator.stages.interleaved.filter.qrcode_filter import InterleavedQRCodeFilterStage
from nemo_curator.stages.interleaved.filter.image_to_text_ratio_filter import (
InterleavedImageToTextRatioFilterStage,
)
from nemo_curator.stages.interleaved.filter.clip_score_filter import (
InterleavedCLIPScoreFilterStage,
)
pipeline = Pipeline(name="interleaved_quality_filters")
# 1. Read interleaved Parquet
pipeline.add_stage(
InterleavedParquetReader(file_paths="s3://bucket/raw/*.parquet")
)
# 2. Cheap filters first (CPU only)
pipeline.add_stage(InterleavedBlurFilterStage(score_threshold=100.0))
pipeline.add_stage(InterleavedQRCodeFilterStage(score_threshold=0.05))
pipeline.add_stage(
InterleavedImageToTextRatioFilterStage(min_ratio=0.001, max_ratio=0.1)
)
# 3. Expensive CLIP filter last (GPU)
pipeline.add_stage(
InterleavedCLIPScoreFilterStage(
model_dir="/models/clip",
min_score=0.20,
)
)
# 4. Write filtered output
pipeline.add_stage(InterleavedWebdatasetWriterStage(output_dir="./curated"))
executor = XennaExecutor()
pipeline.run(executor)
Best Practices
- Filter cheap to expensive: blur → QR-code → image-to-text ratio → CLIP. Each early filter reduces what the next has to score.
- Inspect score distributions before tightening thresholds: run each filter with permissive thresholds first, dump scores to a manifest, plot the distributions, and pick thresholds from percentiles. Defaults are starting points, not final answers.
- Don’t use CLIP without a budget: CLIP scoring is the most expensive filter by orders of magnitude. If your dataset is millions of samples, it’s still feasible but plan compute accordingly.
- Avoid stacking redundant filters: blur and CLIP-score both penalize bad images, but in different ways. Blur catches optical issues; CLIP catches semantic mismatch. Use both, but tune separately.
- Mind the image-to-text ratio for data composition: this filter is the easiest one to misuse — set the wrong
max_ratioand you’ll silently drop most caption-style data. Inspect a sample of dropped vs kept first.
Related Topics
- Interleaved IO — readers and writers that produce and consume the
InterleavedBatchformat these filters operate on. - Nemotron-Parse PDF Pipeline — one source of interleaved data; pair with these filters for end-to-end PDF curation.