Quality Assessment for Audio Data
Filter audio quality using transcription accuracy metrics, duration analysis, and custom quality measures to ensure high-quality speech datasets for ASR training.
How it Works
Audio quality assessment in NeMo Curator focuses on speech-specific metrics that correlate with training data quality:
- Transcription Accuracy: Word Error Rate (WER) and Character Error Rate (CER) between ground truth and ASR predictions
- Duration Analysis: Audio length validation and speech rate calculations
- Value-based Filtering: Configurable filtering using comparison operators
Quality Metrics
Word Error Rate (WER)
The primary metric for assessing transcription quality:
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
# Calculate WER for each audio sample
wer_stage = GetPairwiseWerStage(
text_key="text", # Ground truth transcription
pred_text_key="pred_text", # ASR prediction
wer_key="wer" # Output WER field
)
WER measures the percentage of words that differ between ground truth and predicted transcriptions:
- WER = 0%: Perfect transcription match
- WER = 25%: Good quality (1 in 4 words incorrect)
- WER = 50%: Moderate quality
- WER >75%: Poor quality (consider filtering)
Character Error Rate (CER)
More granular accuracy measurement at the character level. The get_cer() function is a utility for calculating CER programmatically::
from nemo_curator.stages.audio.metrics.get_wer import get_cer
# Calculate CER between two strings (for testing/validation)
cer_value = get_cer("hello world", "helo world") # Returns 9.09
Speech Rate Metrics
NeMo Curator provides utility functions for analyzing speaking speed and content density. These functions are designed for use in custom processing stages:
from nemo_curator.stages.audio.metrics.get_wer import get_wordrate, get_charrate
# Calculate words per second
word_rate = get_wordrate("hello world example", 2.5) # 1.2 words/second
# Calculate characters per second
char_rate = get_charrate("hello world", 2.0) # 5.5 chars/second
Filtering Strategies
WER-based Filtering
Filter audio samples based on transcription accuracy:
from nemo_curator.stages.audio.common import PreserveByValueStage
# Keep samples with WER <= 30% (high quality)
high_quality_filter = PreserveByValueStage(
input_value_key="wer",
target_value=30.0,
operator="le" # less than or equal
)
# Remove samples with WER >= 80% (very poor quality)
poor_quality_filter = PreserveByValueStage(
input_value_key="wer",
target_value=80.0,
operator="lt" # less than
)
# Preserves only entries with WER < 80%
Duration-based Filtering
Filter by audio length to remove short or long samples:
from nemo_curator.stages.audio.common import PreserveByValueStage
# Keep samples between 1-30 seconds
duration_min_filter = PreserveByValueStage(
input_value_key="duration",
target_value=1.0,
operator="ge" # greater than or equal
)
duration_max_filter = PreserveByValueStage(
input_value_key="duration",
target_value=30.0,
operator="le" # less than or equal
)
Combined Quality Filtering
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
# Create multi-stage quality pipeline
quality_pipeline = Pipeline(name="audio_quality_assessment")
# Calculate all metrics
quality_pipeline.add_stage(GetPairwiseWerStage())
quality_pipeline.add_stage(GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
))
# Apply filters in sequence
filters = [
PreserveByValueStage("wer", 50.0, "le"), # WER <= 50%
PreserveByValueStage("duration", 1.0, "ge"), # Duration >= 1s
PreserveByValueStage("duration", 20.0, "le"), # Duration <= 20s
]
for filter_stage in filters:
quality_pipeline.add_stage(filter_stage)
Operator Options
The PreserveByValueStage supports several comparison operators:
| Operator | Description | Example Use Case |
|---|---|---|
"eq" | Equal to | Exact duration matching |
"ne" | Not equal to | Exclude specific values |
"lt" | Less than | Max thresholds |
"le" | Less than or equal | Quality thresholds |
"gt" | Greater than | Min thresholds |
"ge" | Greater than or equal | Min requirements |
Complete Quality Assessment Pipeline
Here’s a complete working example that demonstrates quality assessment:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.audio.inference.asr_nemo import InferenceAsrNemoStage
from nemo_curator.stages.audio.metrics.get_wer import GetPairwiseWerStage
from nemo_curator.stages.audio.common import GetAudioDurationStage, PreserveByValueStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.resources import Resources
# Create complete quality assessment pipeline
pipeline = Pipeline(name="audio_quality_assessment")
# 1. Load data
pipeline.add_stage(CreateInitialManifestFleursStage(
lang="hy_am",
split="dev",
raw_data_dir="./audio_data"
).with_(batch_size=4))
# 2. ASR inference
pipeline.add_stage(InferenceAsrNemoStage(
model_name="nvidia/stt_hy_fastconformer_hybrid_large_pc"
).with_(resources=Resources(gpus=1.0)))
# 3. Calculate quality metrics
pipeline.add_stage(GetPairwiseWerStage())
pipeline.add_stage(GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
))
# 4. Apply quality filters
pipeline.add_stage(PreserveByValueStage(
input_value_key="wer",
target_value=75.0,
operator="le" # Keep WER <= 75%
))
pipeline.add_stage(PreserveByValueStage(
input_value_key="duration",
target_value=1.0,
operator="ge" # Keep duration >= 1s
))
pipeline.add_stage(PreserveByValueStage(
input_value_key="duration",
target_value=30.0,
operator="le" # Keep duration <= 30s
))
# 5. Export high-quality results
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./high_quality_audio"))
# Execute pipeline
executor = XennaExecutor()
pipeline.run(executor)
Related Topics
- WER Filtering - Detailed guide to Word Error Rate filtering
- Duration Filtering - Audio length and speech rate filtering
- Audio Analysis - Audio file analysis and validation