NVIDIANeMo Curator
Menu

VAD Segmentation

Split a continuous audio waveform into discrete speech segments using Silero VAD. Voice activity detection is the first transformative step in most audio curation pipelines — every downstream filter operates on segments, not whole files.

Understanding Voice Activity Detection

What VAD Does

A VAD model classifies short audio frames (typically 30 ms) as either speech or non-speech. VADSegmentationStage runs the classifier across the input waveform, groups contiguous speech frames into segments, and emits one AudioTask per detected segment. Each emitted task carries:

  • start_ms / end_ms — segment boundaries in the original file
  • segment_num — sequential index of the segment
  • duration_sec — segment length
  • waveform — the segment’s torch tensor
  • A PyDub AudioSegment for visualization or export

This fan-out behavior means downstream stages score each segment independently, and individual bad segments can be dropped without losing the rest of the file.

Threshold Guidelines

Silero VAD produces a confidence score from 0.0 to 1.0 for each frame. The threshold parameter controls how confident the model must be before classifying a frame as speech. The following table provides starting points; tune based on your dataset:

ThresholdSpeech RecallUse Case
0.3High (lenient)Noisy field recordings, quiet speakers, podcast audio
0.5Balanced (default)General-purpose curation, clean studio audio
0.7StrictHigh-precision curation; recall a single speaker against background chatter

Lower thresholds keep more borderline audio (potentially more false positives); higher thresholds keep only confident speech (potentially missing quieter passages).

Segment Length

min_duration_sec and max_duration_sec define the acceptable segment-length window:

  • Segments shorter than min_duration_sec are dropped (typically too short to contain useful content).
  • Segments longer than max_duration_sec are split (downstream models often have context-length limits).

Typical training-segment durations: 2–60 seconds for ASR, 5–30 seconds for TTS, 10–120 seconds for ALM.

Basic VAD Segmentation

Step 1: Configure the Stage

from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage

vad = VADSegmentationStage(
    min_duration_sec=2.0,
    max_duration_sec=60.0,
    threshold=0.5,
    speech_pad_ms=300,
)

pipeline.add_stage(vad)

The default resource allocation is Resources(cpus=1.0, gpus=0.0). Silero VAD is small enough that fractional-GPU sharing works well; set gpus=0.1 on the stage’s Resources if you have GPU headroom.

Prerequisites: each input AudioTask must carry a waveform and sample_rate field. Place a MonoConversionStage upstream to guarantee both.

Step 2: Tune the Padding

speech_pad_ms adds padding to either side of each detected segment. Without padding, segments often clip the leading and trailing phonemes of an utterance. Defaults:

VADSegmentationStage(
    speech_pad_ms=300,  # 300 ms each side — preserves natural breathing and onsets
)

For training data going to TTS, increase to 400–500 ms to keep more natural prosody. For ASR transcription work, the default 300 ms is usually sufficient.

Parameters

ParameterTypeDefaultDescription
min_interval_msint500Minimum silence interval (in ms) between distinct speech segments before they merge.
min_duration_secfloat2.0Drop segments shorter than this duration.
max_duration_secfloat60.0Cap segment duration; longer segments are split.
thresholdfloat0.5Silero VAD confidence threshold (0.0–1.0). See Threshold Guidelines.
speech_pad_msint300Padding (in ms) added to either side of each detected segment.
waveform_keystr"waveform"Input task key containing the waveform tensor.
sample_rate_keystr"sample_rate"Input task key containing the sample rate.
nestedboolFalseWhen True, segments inherit the parent task’s mappings field for nested re-segmentation (used in per-speaker filter chains).

Domain-Specific Tuning

Conversational Audio

Conversational speech includes overlapping speakers, disfluencies, and back-channel responses (“uh-huh”). Keep more of it:

VADSegmentationStage(
    min_duration_sec=1.0,    # keep short backchannels
    threshold=0.4,           # lenient — accept marginal speech
    speech_pad_ms=400,       # extra padding for natural turns
)

Studio / Read-Speech Audio

Studio audio has cleanly separated utterances and minimal noise. Use stricter parameters to avoid splitting on internal pauses:

VADSegmentationStage(
    min_duration_sec=2.0,
    threshold=0.6,           # strict — only confident speech
    min_interval_ms=800,     # don't split on short pauses inside a sentence
)

Long-Form Audio (Podcasts, Audiobooks)

Long files benefit from larger max_duration_sec so individual sentences and paragraphs aren’t fragmented:

VADSegmentationStage(
    min_duration_sec=3.0,
    max_duration_sec=30.0,   # natural sentence-paragraph length
    threshold=0.5,
)

Complete VAD Pipeline Example

A minimum-viable pipeline that loads audio, segments it, and writes a manifest of segments:

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter

pipeline = Pipeline(name="vad_segmentation")

# 1. Normalize input audio
pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))

# 2. Segment into speech chunks
pipeline.add_stage(
    VADSegmentationStage(
        min_duration_sec=2.0,
        max_duration_sec=60.0,
        threshold=0.5,
        speech_pad_ms=300,
    )
)

# 3. Export the segment manifest
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./segments"))

executor = XennaExecutor()
pipeline.run(executor)

Best Practices

  • Start with defaults: threshold=0.5 and min_duration_sec=2.0 cover most use cases. Tune only after inspecting a sample of output segments.
  • Pair with quality filters: VAD alone keeps anything that sounds like speech, including low-quality, noisy, or distorted segments. Chain a UTMOS filter and/or SIGMOS filter after VAD to drop low-quality segments.
  • Use nested=True for per-speaker pipelines: when running VAD again on speaker-separated audio downstream of Speaker Separation, set nested=True so the inner VAD inherits the outer segment’s mappings chain — required for correct timestamp resolution at the end.
  • Inspect distributions before filtering aggressively: export a sample manifest with VAD only, then plot the distribution of duration_sec across segments. Use that distribution to choose realistic min_duration_sec and max_duration_sec for your data.