UTMOS Filter
Filter audio segments based on their predicted Mean Opinion Score (MOS) using the utmos22_strong model. UTMOS is the primary perceptual-quality predictor in the audio quality-filtering pipeline.
Understanding UTMOS
What MOS Measures
Mean Opinion Score is a 1.0–5.0 perceptual-quality scale originally defined for human listening tests. UTMOS is a learned no-reference predictor that produces an MOS estimate directly from waveform input — no clean reference signal required, unlike PESQ or POLQA.
| MOS Range | Quality Level | Recommended Use |
|---|---|---|
| 4.0–5.0 | Excellent | High-quality TTS / voice cloning training data |
| 3.5–4.0 | Good | General ASR / TTS training (default threshold range) |
| 3.0–3.5 | Acceptable | Permissive thresholds for large web-scraped datasets |
| 2.0–3.0 | Poor | Review required; usually filtered out |
| < 2.0 | Bad | Strong candidate for removal |
A common starting point is mos_threshold=3.5 — drops obviously distorted, noisy, or clipped audio while keeping most usable training material.
When to Use UTMOS vs SIGMOS
- UTMOS produces a single composite quality score. Use it as the first cheap filter to drop obviously-bad audio.
- SIGMOS produces seven independent dimension scores (noise, signal, reverb, etc.). Use it after UTMOS for fine-grained control over which kinds of degradation to allow.
In a typical pipeline both are stacked: UTMOS first as a coarse cut, SIGMOS second to enforce specific quality requirements.
Basic UTMOS Filtering
Step 1: Configure the Stage
from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
utmos = UTMOSFilterStage(mos_threshold=3.5)
pipeline.add_stage(utmos)
The stage accepts either an in-memory waveform (waveform + sample_rate) or a path (audio_filepath). Multi-channel input is automatically converted to mono, and any sample rate is resampled to 16 kHz before scoring.
Step 2: Inspect the MOS Distribution Before Filtering
For unfamiliar datasets, run UTMOS in score-only mode first by setting mos_threshold=None:
# Score every segment without filtering
pipeline.add_stage(UTMOSFilterStage(mos_threshold=None))
Export the resulting manifest with AudioToDocumentStage + JsonlWriter, then plot the utmos_mos distribution (in pandas, numpy, or your preferred tool) before choosing a real threshold. This avoids over-filtering datasets that are systematically lower-quality than UTMOS’s training distribution.
Step 3: Apply the Tuned Threshold
pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
Segments with predicted MOS below mos_threshold are dropped; segments at or above the threshold pass through unchanged.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
mos_threshold | float | None | 3.5 | Minimum MOS to keep. Set to None to score without filtering (useful for distribution analysis). |
sample_rate | int | 16000 | Target sample rate for UTMOS inference. The model is trained at 16 kHz; do not change unless you have a custom checkpoint. |
The default resource allocation is Resources(cpus=1.0, gpus=0.5). UTMOS is small; fractional-GPU allocation lets it share a device with other inference stages.
Behavior Notes
- Model fetch: the model is downloaded via
torch.hubfromtarepan/SpeechMOS:v1.2.0on first use. - Offline environments: if
torch.hubaccess is unavailable, the stage logs the error and passes the input through unchanged. Pre-cache the model in an air-gapped environment by setting theTORCH_HOMEenvironment variable. - Multi-channel handling: stereo and multi-channel input is converted to mono internally before scoring; you do not need to insert
MonoConversionStagesolely for UTMOS.
Domain-Specific Tuning
Voice Cloning / TTS
TTS training quality is sensitive to background noise, breath sounds, and clipping. Use a strict threshold:
UTMOSFilterStage(mos_threshold=4.0)
General ASR
ASR is more robust to mild quality degradation than TTS. Default works well:
UTMOSFilterStage(mos_threshold=3.5)
Web-Scraped Audio (Permissive)
Web crawls often have systematically lower audio quality. Lowering the threshold preserves more data; pair with stricter SIGMOS thresholds for targeted dimensions:
UTMOSFilterStage(mos_threshold=3.0)
# Then SIGMOSFilterStage(noise_threshold=4.0, ovrl_threshold=3.0) downstream
Complete UTMOS Pipeline Example
from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter
pipeline = Pipeline(name="utmos_filtering")
# 1. Normalize input
pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
# 2. Segment into speech chunks
pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))
# 3. Filter by UTMOS (drop MOS < 3.5)
pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))
# 4. Export filtered manifest
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./utmos_filtered"))
executor = XennaExecutor()
pipeline.run(executor)
Best Practices
- Inspect before filtering: always run with
mos_threshold=Nonefirst on a representative sample. Pick the threshold from the actual distribution, not from the table above. - Stack UTMOS before SIGMOS: UTMOS is cheaper than SIGMOS (single score vs seven dimensions). Run UTMOS first as a coarse cut, then SIGMOS for fine-grained dimension filtering.
- Match threshold to downstream model: TTS (4.0+), ASR (3.5), permissive curation (3.0). The expected use of the data dictates the threshold.
- Don’t change
sample_rate: the UTMOS model is trained at 16 kHz. Override only with a custom checkpoint trained at a different rate.
Related Topics
- SIGMOS Filter — independent perceptual-quality dimensions; commonly stacked after UTMOS.
- VAD Segmentation — typical upstream stage producing the segments UTMOS scores.
AudioDataFilterStageComposite — bundles UTMOS into the standard pipeline.