Fuzzy Duplicate Removal

Find and remove near-duplicate documents with small edits or reformatting using MinHash and Locality Sensitive Hashing (LSH). This approach identifies candidate pairs with a similarity threshold efficiently at scale on GPU.

For other approaches, refer to Deduplication .

How It Works

Fuzzy deduplication uses MinHash and LSH to find near-duplicate content:

Computes MinHash signatures over character n-grams
Uses Locality Sensitive Hashing (LSH) to find candidate matches
Builds a graph of duplicate relationships
Identifies groups of near-duplicate documents

Ideal for detecting documents with minor differences such as formatting changes, typos, or small edits, where documents share a high degree of overlapping content.

Before You Start

Prerequisites:

Ray cluster with GPU support (required for distributed processing)
Stable document identifiers for removal (either existing IDs or IDs generated by the workflow and removal stages)

Quick Start

Get started with fuzzy deduplication using the following example of identifying duplicates, then remove them:

from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

ray_client = RayClient()
ray_client.start()

# Step 1: Identify duplicates
fuzzy_workflow = FuzzyDeduplicationWorkflow(
    input_path="input_data/",
    cache_path="./cache",
    output_path="./results",
    text_field="text",
    perform_removal=False,
    input_filetype="parquet",
    char_ngrams=24,
    num_bands=20,
    minhashes_per_band=13
)
result = fuzzy_workflow.run()
# result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_path

# Step 2: Remove duplicates
removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="input_data/",
    ids_to_remove_path="./results/FuzzyDuplicateIds",
    output_path="./deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="./results/fuzzy_id_generator.json"
)
result = removal_workflow.run()
# result.metadata contains: total_time, num_duplicates_removed

Configuration

Configure fuzzy deduplication using these key parameters:

Parameter	Type	Default	Description
`input_path`	str \| list[str]	None	Path(s) to input files or directories
`cache_path`	str	Required	Directory to cache intermediate results
`output_path`	str	Required	Directory to write duplicate IDs and ID generator
`text_field`	str	”text”	Name of the text field in input data
`char_ngrams`	int	24	Character n-gram size for MinHash (recommended: >= 20)
`num_bands`	int	20	Number of LSH bands (affects similarity threshold)
`minhashes_per_band`	int	13	Number of hashes per LSH band
`bands_per_iteration`	int	5	Bands processed per iteration (memory tuning)
`use_64_bit_hash`	bool	False	Use 64-bit hash (more memory, fewer collisions)
`seed`	int	42	Random seed for MinHash permutations
`input_filetype`	str	”parquet”	Input file format (“parquet” or “jsonl”)
`input_blocksize`	str \| int	”1GiB”	Size of input blocks for processing
`lsh_num_output_partitions`	int \| None	None	Total number of partitions to write during the LSH shuffle. If `None`, the partition count is chosen automatically as the closest power of 2 <= the number of input tasks.
`lsh_rmm_pool_size`	int \| “auto” \| None	”auto”	Size of the RMM GPU memory pool in bytes for the LSH stage. `"auto"` sets the pool to 90% of free GPU memory. `None` sets the pool to 50% of free GPU memory and allows expansion.
`lsh_spill_memory_limit`	int \| “auto” \| None	”auto”	Device memory limit in bytes for spilling to host during the LSH stage. `"auto"` sets the limit to 80% of the RMM pool size. `None` disables spilling.
`perform_removal`	bool	False	Reserved; must remain `False`. Fuzzy removal is performed with `TextDuplicatesRemovalWorkflow`.

Similarity Threshold

Control matching strictness with num_bands and minhashes_per_band:

More strict matching: Increase num_bands or decrease minhashes_per_band
Less strict matching: Decrease num_bands or increase minhashes_per_band

Default (num_bands=20, minhashes_per_band=13) provides a balanced trade-off between recall and precision for many datasets. The exact similarity at which pairs are detected depends on your data distribution.

Custom Similarity Threshold

# Example: stricter matching (fewer pairs detected, higher required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
    num_bands=25,           # More bands = stricter matching
    minhashes_per_band=10  # Fewer hashes per band = stricter matching
)

# Example: less strict matching (more pairs detected, lower required similarity)
fuzzy_workflow = FuzzyDeduplicationWorkflow(
    num_bands=15,           # Fewer bands = less strict matching
    minhashes_per_band=15  # More hashes per band = less strict matching
)

Removing Duplicates

After identifying duplicates, use TextDuplicatesRemovalWorkflow to remove them:

from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path="/path/to/input/data",
    ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
    output_path="/path/to/deduplicated",
    input_filetype="parquet",
    input_id_field="_curator_dedup_id",
    ids_to_remove_duplicate_id_field="_curator_dedup_id",
    id_generator_path="/path/to/output/fuzzy_id_generator.json"  # Required if IDs were auto-assigned
)
result = removal_workflow.run()

ID Field Configuration

When IDs were auto-assigned:

id_generator_path is required
Ensures consistent ID mapping between identification and removal stages

Output Format

The fuzzy deduplication process produces the following directory structure:

cache_path/
├── MinHashStage/                    # MinHash signatures
│   └── *.parquet
├── LSHStage/                        # LSH buckets
│   └── *.parquet
├── BucketsToEdges/                  # Graph edges
│   └── *.parquet
└── ConnectedComponents/             # Connected components
    └── *.parquet

output_path/
├── FuzzyDuplicateIds/               # Duplicate identification results
│   └── *.parquet                    # Parquet files with document IDs to remove
└── fuzzy_id_generator.json          # ID generator mapping (if IDs were auto-assigned)

File Formats

The workflow produces these output files:

Duplicate IDs (FuzzyDuplicateIds/*.parquet):
- Contains document IDs to remove
- Format: Parquet files with column: ["_curator_dedup_id"]
- Important: Contains only the IDs of documents to remove, not the full document content
ID Generator (fuzzy_id_generator.json):
- JSON file containing ID generator state
- Required for removal workflow when IDs were auto-assigned
- Ensures consistent ID mapping across workflow stages
Cache Files (cache_path/):
- Intermediate results for debugging and analysis
- Can be reused if re-running with different parameters
- Clear cache between runs if parameters change significantly

Performance Considerations

Performance characteristics:

GPU-accelerated MinHash and LSH operations
Scales across multiple GPUs and nodes using Ray
bands_per_iteration controls memory usage
Intermediate results are cached for efficiency

GPU requirements:

NVIDIA GPU with CUDA support
Ray cluster with GPU workers

Performance tuning:

Memory: Adjust bands_per_iteration (lower = less memory, more iterations)
GPU memory (LSH): Use lsh_rmm_pool_size to control GPU memory allocation and lsh_spill_memory_limit to tune host-spilling behavior during the LSH stage. Reducing the pool size or lowering the spill threshold can prevent out-of-memory errors on smaller GPUs.
Shuffle partitions: Set lsh_num_output_partitions to control the number of output partitions during the LSH shuffle. More partitions reduce per-partition memory but increase I/O overhead.
Accuracy: Use char_ngrams >= 20 to reduce false positives
Best practices: Clear cache between runs, use input_blocksize="1GiB"

Note: Performance depends on hardware configuration, dataset characteristics, and parameter choices such as bands_per_iteration, char_ngrams, and input_blocksize.

For comparison with other deduplication methods and guidance on when to use fuzzy deduplication, refer to the Deduplication overview .