Audio Quality Filtering
Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.
How it Works
A typical pipeline composes the following stages in order:
- Mono conversion normalizes channels and sample rate.
- Voice activity detection (VAD) splits each file into speech segments.
- Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
- UTMOS filters segments below a perceived-quality threshold.
- SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
- Segment concatenation merges surviving segments back together with configurable silence between them.
- Speaker separation diarizes the concatenated audio and fans out one task per speaker.
- Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
- Timestamp mapping projects final segment boundaries back to positions in the original input file.
Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.
Pipeline Stages
Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper
VAD SegmentationSplit audio into speech segments using Silero VAD silero fan-out configurable
Band FilterClassify and filter audio by spectral bandwidth full-band narrow-band sklearn
UTMOS FilterFilter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference
SIGMOS FilterFilter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable
Speaker SeparationDiarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization
AudioDataFilterStage CompositeSingle composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end
Quick Example
A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
pipeline = Pipeline(name="audio_quality_filtering")
# 1. Normalize channels and sample rate
pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
# 2. Split into speech segments
pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0, threshold=0.5))
# 3. Filter by perceptual quality (drop segments with MOS < 3.5)
pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
# 4. Filter by SIGMOS noise + overall thresholds
pipeline.add_stage(SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.5))
# 5. Concatenate surviving segments
pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))
# 6. Diarize and fan out per speaker
pipeline.add_stage(SpeakerSeparationStage())
# 7. Map final boundaries back to original file timestamps
pipeline.add_stage(TimestampMapperStage())
# 8. Export
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./curated_audio"))
executor = XennaExecutor()
pipeline.run(executor)
For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.
Related Topics
- ReadSpeech Tutorial — end-to-end walkthrough of
AudioDataFilterStageon the DNS Challenge ReadSpeech dataset. - Quality Assessment — WER and duration filters for ASR-based curation.
- Audio Concepts — audio task model, manifests, and pipeline architecture.