Deduplication
Remove duplicate and near-duplicate documents from text datasets using NeMo Curator’s GPU-accelerated deduplication workflows. Removing duplicates prevents overrepresentation of repeated content in language model training.
NeMo Curator provides three deduplication approaches: exact matching (MD5 hashing), fuzzy matching (MinHash + LSH), and semantic matching (embeddings). All methods are GPU-accelerated and integrate with the data processing pipeline .
How It Works
NeMo Curator provides three deduplication approaches, each optimized for different duplicate types:
Exact
Method: MD5 hashing Detects: Character-for-character identical documents Speed: Fastest
Computes MD5 hashes for each document’s text content and groups documents with identical hashes. Best for removing exact copies.
Code Example
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
exact_workflow = ExactDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
assign_id=True,
input_filetype="parquet"
)
result = exact_workflow.run()
# result.metadata contains: total_time, num_duplicates, identification_time, id_generator_pathFor removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Exact Duplicate Removal for details.
Fuzzy
Method: MinHash + Locality Sensitive Hashing (LSH) Detects: Near-duplicates with minor edits (~80% similarity) Speed: Fast
Generates MinHash signatures and uses LSH to find similar documents. Best for detecting documents with small formatting differences or typos.
Code Example
from nemo_curator.core.client import RayClient
from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
ray_client = RayClient()
ray_client.start()
fuzzy_workflow = FuzzyDeduplicationWorkflow(
input_path="/path/to/input/data",
cache_path="/path/to/cache",
output_path="/path/to/output",
text_field="text",
perform_removal=False, # Identification only
input_blocksize="1GiB",
seed=42,
char_ngrams=24,
num_bands=20,
minhashes_per_band=13
)
result = fuzzy_workflow.run()
# result.metadata contains: total_time, num_duplicates, minhash_time, lsh_time, connected_components_pipeline_time, id_generator_pathFor removal, use TextDuplicatesRemovalWorkflow with the generated duplicate IDs. See Fuzzy Duplicate Removal for details.
Semantic
Method: Embeddings + clustering + pairwise similarity Detects: Semantically similar content (paraphrases, translations) Speed: Moderate
Generates embeddings using transformer models, clusters them, and computes pairwise similarities. Best for meaning-based deduplication.
Code Example
from nemo_curator.stages.text.deduplication.semantic import TextSemanticDeduplicationWorkflow
text_workflow = TextSemanticDeduplicationWorkflow(
input_path="/path/to/input/data",
output_path="/path/to/output",
cache_path="/path/to/cache",
text_field="text",
n_clusters=100,
eps=0.01, # Similarity threshold
perform_removal=True # Complete deduplication
)
result = text_workflow.run()
# result.metadata contains: total_time, num_duplicates, num_duplicates_removedNote: Two workflows available:
TextSemanticDeduplicationWorkflow: For raw text with automatic embedding generationSemanticDeduplicationWorkflow: For pre-computed embeddings
See Semantic Deduplication for details.
Advanced: Step-by-Step Semantic Deduplication
For fine-grained control, break semantic deduplication into separate stages:
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor
from nemo_curator.stages.text.embedders.vllm import VLLMEmbeddingModelStage
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
# 1. Create ID generator
create_id_generator_actor()
# 2. Generate embeddings separately (using vLLM)
embedding_pipeline = Pipeline(
stages=[
ParquetReader(file_paths=input_path, _generate_ids=True),
VLLMEmbeddingModelStage(
model_identifier="google/embeddinggemma-300m",
text_field="text",
embedding_field="embeddings",
),
ParquetWriter(path=embedding_output_path, fields=["_curator_dedup_id", "embeddings"])
]
)
embedding_out = embedding_pipeline.run()
# 3. Run clustering and pairwise similarity
semantic_workflow = SemanticDeduplicationWorkflow(
input_path=embedding_output_path,
output_path=semantic_workflow_path,
n_clusters=100,
id_field="_curator_dedup_id",
embedding_field="embeddings",
eps=None # Skip duplicate identification for analysis
)
result = semantic_workflow.run()
# 4. Analyze results and choose eps parameter
# 5. Identify and remove duplicatesThis approach enables analysis of intermediate results and fine-grained control.
Deduplication Methods
Choose a deduplication method based on your needs:
Identify and remove character-for-character duplicates using MD5 hashing hashing fast gpu-accelerated
Fuzzy Duplicate RemovalIdentify and remove near-duplicates using MinHash and LSH similarity minhash lsh gpu-accelerated
Semantic DeduplicationRemove semantically similar documents using embeddings embeddings gpu-accelerated meaning-based advanced
Common Operations
Document IDs
Duplicate removal workflows require stable document identifiers. Choose one approach:
- Use
AddIdto add IDs at the start of your pipeline - Use reader-based ID generation (
_generate_ids,_assign_ids) backed by the ID Generator actor for stable integer IDs - Use existing IDs if your documents already have unique identifiers
Some workflows write an ID generator state file (*_id_generator.json) for later removal when IDs are auto-assigned.
Removing Duplicates
Use TextDuplicatesRemovalWorkflow to apply duplicate IDs to your original dataset. Works with IDs from exact, fuzzy, or semantic deduplication.
from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path="/path/to/input",
ids_to_remove_path="/path/to/duplicates", # ExactDuplicateIds/, FuzzyDuplicateIds/, or duplicates/
output_path="/path/to/clean",
input_filetype="parquet",
input_id_field="_curator_dedup_id",
ids_to_remove_duplicate_id_field="_curator_dedup_id",
id_generator_path="/path/to/id_generator.json" # Required when IDs were auto-assigned
)
result = removal_workflow.run()
# result.metadata contains: total_time, num_duplicates_removed
ID Field Configuration
When assign_id=True (IDs auto-assigned):
- Duplicate IDs file contains
_curator_dedup_idcolumn - Set
ids_to_remove_duplicate_id_field="_curator_dedup_id" id_generator_pathis required
When assign_id=False (using existing IDs):
- Duplicate IDs file contains the column specified by
id_field(e.g.,"id") - Set
ids_to_remove_duplicate_id_fieldto match yourid_fieldvalue id_generator_pathnot required
Workflow Results
All deduplication workflows return a WorkflowRunResult object with timing and duplicate count metadata:
from nemo_curator.pipeline.workflow import WorkflowRunResult
result = exact_workflow.run()
print(result.metadata) # {"total_time": 42.1, "num_duplicates": 1500, ...}
Available metadata varies by workflow. Common keys include total_time and num_duplicates.
Outputs and Artifacts
Each deduplication method produces specific output files and directories:
| Method | Duplicate IDs Location | ID Generator File | Deduplicated Output |
|---|---|---|---|
| Exact | ExactDuplicateIds/ (parquet) | exact_id_generator.json (if assign_id=True) | Via TextDuplicatesRemovalWorkflow |
| Fuzzy | FuzzyDuplicateIds/ (parquet) | fuzzy_id_generator.json (if IDs auto-assigned) | Via TextDuplicatesRemovalWorkflow |
| Semantic | output_path/duplicates/ (parquet) | N/A | output_path/deduplicated/ (if perform_removal=True) |
Column names:
_curator_dedup_idwhenassign_id=Trueor IDs are auto-assigned- Matches
id_fieldparameter whenassign_id=False
Choosing a Deduplication Method
Compare deduplication methods to select the best approach for your dataset:
| Method | Best For | Speed | Duplicate Types | GPU Required |
|---|---|---|---|---|
| Exact | Identical copies | Very fast | Character-for-character matches | Required |
| Fuzzy | Near-duplicates with small changes | Fast | Minor edits, reformatting (~80% similarity) | Required |
| Semantic | Similar meaning, different words | Moderate | Paraphrases, translations, rewrites | Required |
Quick Decision Guide
Use this guide to quickly select the right method:
- Start with Exact if you have numerous identical documents or need the fastest speed
- Use Fuzzy if you need to catch near-duplicates with minor formatting differences
- Use Semantic for meaning-based deduplication on large, diverse datasets
When to Use Each Method
Exact Deduplication:
- Removing identical copies of documents
- Fast initial deduplication pass
- Datasets with numerous exact duplicates
- When speed is more important than detecting near-duplicates
Fuzzy Deduplication:
- Removing near-duplicate documents with minor formatting differences
- Detecting documents with small edits or typos
- Fast deduplication when exact matching misses numerous duplicates
- When speed is important but some near-duplicate detection is needed
Semantic Deduplication:
- Removing semantically similar content (paraphrases, translations)
- Large, diverse web-scale datasets
- When meaning-based deduplication is more important than speed
- Advanced use cases requiring embedding-based similarity detection
Combining Methods
You can combine deduplication methods for comprehensive duplicate removal:
- Exact → Fuzzy → Semantic: Start with fastest methods, then apply more sophisticated methods
- Exact → Semantic: Use exact for quick wins, then semantic for meaning-based duplicates
- Fuzzy → Semantic: Use fuzzy for near-duplicates, then semantic for paraphrases
Run each method independently, then combine duplicate IDs before removal.
For detailed implementation guides, see:
Performance Considerations
GPU Acceleration
All deduplication workflows require GPU acceleration:
- Exact: Ray backend with GPU support for MD5 hashing operations
- Fuzzy: Ray backend with GPU support for MinHash computation and LSH operations
- Semantic: GPU required for embedding generation (transformer models), K-means clustering, and pairwise similarity computation
GPU acceleration provides significant speedup for large datasets through parallel processing.
Hardware Requirements
- GPU: Required for all workflows (Ray with GPU support for exact/fuzzy, GPU for semantic)
- Memory: GPU memory requirements scale with dataset size, batch sizes, and embedding dimensions
- Executors: Can use various executors (XennaExecutor, RayDataExecutor) with GPU support
Backend Setup
For optimal performance with large datasets, configure Ray backend:
from nemo_curator.core.client import RayClient
client = RayClient(
num_cpus=64, # Adjust based on available cores
num_gpus=4 # Should be roughly 2x the memory of embeddings
)
client.start()
try:
result = workflow.run()
finally:
client.stop()For TB-scale datasets, consider distributed GPU clusters with Ray.
ID Generator for Large-Scale Operations
For large-scale duplicate removal, persist the ID Generator for consistent document tracking:
from nemo_curator.stages.deduplication.id_generator import (
create_id_generator_actor,
write_id_generator_to_disk,
kill_id_generator_actor
)
create_id_generator_actor()
id_generator_path = "semantic_id_generator.json"
write_id_generator_to_disk(id_generator_path)
kill_id_generator_actor()
# Use saved ID generator in removal workflow
removal_workflow = TextDuplicatesRemovalWorkflow(
input_path=input_path,
ids_to_remove_path=duplicates_path,
output_path=output_path,
id_generator_path=id_generator_path,
# ... other parameters
)The ID Generator ensures consistent IDs across workflow stages.
Next Steps
Ready to use deduplication?
- New to deduplication: Start with Exact Duplicate Removal for the fastest approach
- Need near-duplicate detection: See Fuzzy Duplicate Removal for MinHash-based matching
- Require semantic matching: Explore Semantic Deduplication for meaning-based deduplication
For hands-on guidance: See Text Curation Tutorials for step-by-step examples.