Audio Quality Filtering

Curate raw audio into clean, single-speaker training segments using a configurable suite of preprocessing, segmentation, and filtering stages. Use these stages individually for targeted curation, or compose them through AudioDataFilterStage for an end-to-end pipeline driven by a single YAML config.

How it Works

A typical pipeline composes the following stages in order:

Mono conversion normalizes channels and sample rate.
Voice activity detection (VAD) splits each file into speech segments.
Band filter drops segments that are not full-band (or not narrow-band, depending on the configured target).
UTMOS filters segments below a perceived-quality threshold.
SIGMOS filters segments by per-dimension quality scores (noise, overall, signal, coloration, discontinuity, loudness, reverb).
Segment concatenation merges surviving segments back together with configurable silence between them.
Speaker separation diarizes the concatenated audio and fans out one task per speaker.
Per-speaker filters rerun VAD/Band/UTMOS/SIGMOS on each speaker’s audio independently.
Timestamp mapping projects final segment boundaries back to positions in the original input file.

Each stage is independently usable. Use AudioDataFilterStage to compose all of them with a single YAML config, or assemble a custom subset stage-by-stage.

Pipeline Stages

Preprocessing Stages

Channel normalization, segment merging, and original-file timestamp mapping mono-conversion concatenation timestamp-mapper

VAD Segmentation

Split audio into speech segments using Silero VAD silero fan-out configurable

Band Filter

Classify and filter audio by spectral bandwidth full-band narrow-band sklearn

UTMOS Filter

Filter by predicted Mean Opinion Score using utmos22_strong mos torch-hub no-reference

SIGMOS Filter

Filter by seven independent perceptual-quality dimensions onnx multi-dimensional configurable

Speaker Separation

Diarize with offline or streaming SortFormer and fan out per speaker sortformer streaming diarization

AudioDataFilterStage Composite

Single composite stage that decomposes into the full filtering pipeline from a YAML config composite yaml-config end-to-end

Quick Example

A complete VAD + UTMOS + SIGMOS + speaker separation pipeline assembled stage-by-stage:

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
from nemo_curator.stages.audio.filtering.sigmos import SIGMOSFilterStage
from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter

pipeline = Pipeline(name="audio_quality_filtering")

# 1. Normalize channels and sample rate
pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))

# 2. Split into speech segments
pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0, threshold=0.5))

# 3. Filter by perceptual quality (drop segments with MOS < 3.5)
pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))

# 4. Filter by SIGMOS noise + overall thresholds
pipeline.add_stage(SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.5))

# 5. Concatenate surviving segments
pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))

# 6. Diarize and fan out per speaker
pipeline.add_stage(SpeakerSeparationStage())

# 7. Map final boundaries back to original file timestamps
pipeline.add_stage(TimestampMapperStage())

# 8. Export
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./curated_audio"))

executor = XennaExecutor()
pipeline.run(executor)

For a YAML-driven equivalent, use AudioDataFilterStage — it expands into the same pipeline from a single configuration file.

ReadSpeech Tutorial — end-to-end walkthrough of AudioDataFilterStage on the DNS Challenge ReadSpeech dataset.
Quality Assessment — WER and duration filters for ASR-based curation.
Audio Concepts — audio task model, manifests, and pipeline architecture.

Audio Quality Filtering

How it Works

Pipeline Stages

Quick Example

Related Topics