Classifier-Based Filtering
Classifier-based filtering uses machine learning models to differentiate between high-quality and low-quality documents. NVIDIA NeMo Curator implements an approach similar to the one described in Brown et al., 2020, which trains a binary skip-gram classifier to distinguish between curated high-quality data and lower-quality data.
How It Works
Classifier-based filtering learns the characteristics of high-quality documents from training data, unlike heuristic filtering which relies on predefined rules and thresholds. This approach is particularly effective when:
- You have a reference dataset of known high-quality documents
- The distinction between high and low quality is complex or subtle
- You want to filter based on domain-specific characteristics
NVIDIA NeMo Curator uses fastText for implementing classifier-based filtering, which offers excellent performance and scalability for text classification tasks.
The classifier-based filtering process involves:
- Preparing training data by sampling from high-quality and low-quality datasets
- Training a binary skip-gram classifier using fastText
- Using the trained model to score documents in your dataset
- Filtering documents based on the classifier scores, optionally using Pareto-based sampling
Usage
NeMo Curator provides two approaches for quality assessment:
- Classification: Use
QualityClassifierto add quality predictions and optionally filter during classification - Filtering: Use
FastTextQualityFilterwithScoreFilterfor document-level filtering with Pareto sampling
DeBERTa Quality Classification
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.classifiers import QualityClassifier
# Create pipeline with DeBERTa quality classifier
pipeline = Pipeline(name="deberta_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
classify_stage = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(classify_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()FastText Quality Filter
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.filters import ScoreFilter
from nemo_curator.stages.text.filters.fasttext import FastTextQualityFilter
# Create pipeline with FastText filter (requires pre-trained model)
pipeline = Pipeline(name="fasttext_quality_pipeline")
# Add stages
read_stage = JsonlReader("input_data/")
filter_stage = ScoreFilter(
FastTextQualityFilter(
model_path="./quality_classifier.bin", # Path to your fastText model
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
),
text_field="text",
score_field="quality_score"
)
write_stage = JsonlWriter("high_quality_output/")
pipeline.add_stage(read_stage)
pipeline.add_stage(filter_stage)
pipeline.add_stage(write_stage)
# Execute pipeline
results = pipeline.run()Configuration
You can configure quality classifiers and filters with different parameters:
from nemo_curator.stages.text.classifiers import QualityClassifier
from nemo_curator.stages.text.filters.fasttext import FastTextQualityFilter
# DeBERTa quality classifier configurations
basic_deberta_classifier = QualityClassifier(
filter_by=["High"], # Keep only high-quality documents
model_inference_batch_size=256,
max_chars=6000 # Default value
)
# More inclusive DeBERTa classifier
inclusive_deberta_classifier = QualityClassifier(
filter_by=["Medium", "High"], # Keep medium and high-quality documents
model_inference_batch_size=128,
max_chars=6000
)
# FastText quality filter configurations
basic_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq", # High quality label
alpha=3, # Pareto distribution alpha parameter
seed=42 # Random seed for reproducibility
)
# More selective FastText filter
selective_fasttext_filter = FastTextQualityFilter(
model_path="./quality_classifier.bin",
label="__label__hq",
alpha=5, # Higher alpha for stricter filtering
seed=42
)Quality Classifier and Filter Parameters
QualityClassifier (DeBERTa)
The QualityClassifier accepts the following parameters:
filter_by(list, default=None): Quality levels to keep (options: “Low”, “Medium”, “High”)model_inference_batch_size(int, default=256): Batch size for inferencemax_chars(int, default=6000): Max characters per document for processinglabel_field(str, default=“quality_pred”): Name of the prediction columntext_field(str, default=“text”): Name of the text field in input data
FastTextQualityFilter
The FastTextQualityFilter accepts the following parameters:
model_path(str, required): Path to the trained fastText model filelabel(str, default=“__label__hq”): The label for high-quality documentsalpha(float, default=3): Alpha parameter for Pareto distribution samplingseed(int, default=42): Random seed for reproducible sampling
Best Practices
For effective classifier-based filtering:
- Model selection: Start with the DeBERTa quality classifier for general use cases; consider fastText for high-throughput scenarios
- Validation: Manually review a sample of filtered results to confirm effectiveness
- Quality level tuning: Adjust
filter_bylevels (DeBERTa) oralphavalues (fastText) based on your quality requirements - Batch size optimization: Tune
model_inference_batch_sizefor DeBERTa models based on your available memory - Combination with heuristics: Consider using heuristic filters as a pre-filter to improve efficiency
- Domain adaptation: For specialized corpora, consider training custom models using domain-specific data