Speaker Separation

Diarize multi-speaker audio and fan out one task per detected speaker so that downstream stages can score each speaker’s audio independently. NeMo Curator ships two diarization stages built on NVIDIA’s SortFormer family. Both target up to 4 speakers per file; choose based on whether your workload is offline batch curation or streaming/online.

Understanding Diarization

What Diarization Does

Diarization answers “who spoke when?” — it segments an audio stream into per-speaker regions, identifying that speaker A talks 0.0–3.5s, speaker B talks 3.5–7.0s, speaker A returns at 7.0–9.0s, and so on. The output is one AudioTask per speaker, each containing only that speaker’s audio.

This unlocks per-speaker filtering: pipelines can rerun VAD, UTMOS, SIGMOS, and the band filter separately on each speaker, dropping individual low-quality speakers without losing the rest of the recording.

Choosing a Stage

Stage	Model	Best For
`SpeakerSeparationStage`	`nvidia/diar_sortformer_4spk-v1` (offline)	Bulk offline curation. Used inside `AudioDataFilterStage`. Higher accuracy because it sees the whole utterance.
`InferenceSortformerStage`	`nvidia/diar_streaming_sortformer_4spk-v2.1` (streaming)	Online/chunked workloads with bounded latency. Supports RTTM output for downstream tools.

For most curation pipelines, SpeakerSeparationStage (offline) is the right choice. Use the streaming variant only when you need bounded latency or RTTM output.

Offline Speaker Separation

Step 1: Configure the Stage

from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage

speaker_sep = SpeakerSeparationStage(
    model_path="nvidia/diar_sortformer_4spk-v1",
    exclude_overlaps=True,
    min_duration=0.8,
    gap_threshold=0.1,
    buffer_time=0.5,
)
pipeline.add_stage(speaker_sep)

The stage produces a fan-out list of AudioTask objects, one per detected speaker, each carrying:

speaker_id — speaker identifier (0, 1, 2, …)
num_speakers — total speakers found in this file
duration_sec — duration of this speaker’s audio
waveform — that speaker’s torch tensor with overlapping regions removed (when exclude_overlaps=True)

GPU is required (Resources(cpus=1.0, gpus=1.0) by default).

Step 2: Tune Overlap and Gap Handling

Speaker overlap regions (where multiple speakers talk simultaneously) and short gaps between same-speaker turns affect output quality:

Parameter	Effect
`exclude_overlaps=True` (default)	Drops overlapping regions. Better for clean per-speaker training data.
`exclude_overlaps=False`	Includes overlapping regions on each speaker’s audio. Useful when you want to preserve natural conversation.
`gap_threshold=0.1` (default)	Same-speaker turns separated by < 100 ms are merged. Increase to 0.3–0.5 for more aggressive merging on fragmented diarization.
`min_duration=0.8` (default)	Drop speakers whose total audio is shorter than 0.8 seconds. Filters out spurious speaker detections.
`buffer_time=0.5` (default)	Buffer (in seconds) added around each merged speaker segment to avoid clipping turn boundaries.

`SpeakerSeparationStage` Parameters

Parameter	Type	Default	Description
`model_path`	str	`"nvidia/diar_sortformer_4spk-v1"`	Hugging Face model ID or path to a `.nemo` checkpoint.
`exclude_overlaps`	bool	`True`	Drop regions where multiple speakers overlap.
`min_duration`	float	`0.8`	Minimum per-speaker segment duration (seconds).
`gap_threshold`	float	`0.1`	Gap threshold for merging adjacent same-speaker segments.
`buffer_time`	float	`0.5`	Buffer (seconds) added around each merged speaker segment.

Streaming Speaker Diarization

When to Use Streaming

The streaming variant (InferenceSortformerStage) is purpose-built for two use cases:

Online / chunked workloads — bounded latency requirements that can’t tolerate waiting for the full utterance.
RTTM output — downstream tooling (Kaldi, ESPnet, evaluation harnesses) consumes RTTM-format diarization output.

For pure offline curation, SpeakerSeparationStage is faster and more accurate.

Step 1: Configure the Stage

from nemo_curator.stages.audio.inference.speaker_diarization.sortformer import InferenceSortformerStage

streaming = InferenceSortformerStage(
    model_name="nvidia/diar_streaming_sortformer_4spk-v2.1",
    rttm_out_dir="./rttm",
    chunk_len=340,            # ~30.4 seconds latency in 80 ms frames
    inference_batch_size=1,
)
pipeline.add_stage(streaming)

This stage does not fan out per speaker — instead it writes a diar_segments list onto the input AudioTask. Use it as a metadata-enriching stage; downstream code consumes the diar_segments field directly.

Step 2: Tune Latency

chunk_len controls latency vs accuracy:

`chunk_len`	Latency	Accuracy
100 (~8 s)	Low	Lower (less context)
340 (default, ~30.4 s)	Medium	Good
600 (~48 s)	High	Best

Streaming-mode evaluation on CallHome-eng0 (139 files) at the default settings: 6.2% macro DER, 6.0% weighted DER at a 0.25-second collar.

`InferenceSortformerStage` Parameters

Parameter	Type	Default	Description
`model_name`	str	`"nvidia/diar_streaming_sortformer_4spk-v2.1"`	Hugging Face model ID.
`model_path`	str \| None	`None`	Local `.nemo` checkpoint; overrides `model_name` when set.
`cache_dir`	str \| None	`None`	Cache dir for downloaded model weights.
`filepath_key`	str	`"audio_filepath"`	Manifest key with the audio path.
`diar_segments_key`	str	`"diar_segments"`	Output manifest key for the diarization segment list.
`rttm_out_dir`	str \| None	`None`	Optional directory to write per-file RTTM.
`chunk_len`	int	`340`	Streaming chunk size in 80 ms frames.
`chunk_left_context`	int	`1`	Left-context frames retained between chunks.
`chunk_right_context`	int	`40`	Right-context frames retained between chunks.
`fifo_len`	int	`40`	FIFO queue size in frames.
`spkcache_update_period`	int	`300`	Speaker-cache update period in frames.
`spkcache_len`	int	`188`	Speaker-cache size in frames.
`inference_batch_size`	int	`1`	Batch size passed to `diarize()`.

Default resource allocation: Resources(cpus=1.0, gpu_memory_gb=8.0).

Complete Speaker Separation Pipeline

A pipeline that diarizes, then runs per-speaker quality filters:

from nemo_curator.pipeline import Pipeline
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.audio.preprocessing.mono_conversion import MonoConversionStage
from nemo_curator.stages.audio.segmentation.vad_segmentation import VADSegmentationStage
from nemo_curator.stages.audio.preprocessing.concatenation import SegmentConcatenationStage
from nemo_curator.stages.audio.segmentation.speaker_separation import SpeakerSeparationStage
from nemo_curator.stages.audio.filtering.utmos import UTMOSFilterStage
from nemo_curator.stages.audio.postprocessing.timestamp_mapper import TimestampMapperStage
from nemo_curator.stages.audio.io.convert import AudioToDocumentStage
from nemo_curator.stages.text.io.writer import JsonlWriter

pipeline = Pipeline(name="speaker_diarization")

# 1. Normalize and segment
pipeline.add_stage(MonoConversionStage(output_sample_rate=48000))
pipeline.add_stage(VADSegmentationStage(min_duration_sec=2.0))

# 2. Concatenate surviving segments per file
pipeline.add_stage(SegmentConcatenationStage(silence_duration_sec=0.5))

# 3. Diarize and fan out per speaker
pipeline.add_stage(SpeakerSeparationStage(exclude_overlaps=True))

# 4. Per-speaker quality filter
pipeline.add_stage(UTMOSFilterStage(mos_threshold=3.5))

# 5. Resolve final timestamps
pipeline.add_stage(
    TimestampMapperStage(
        passthrough_keys=["speaker_id", "num_speakers", "utmos_mos"]
    )
)

# 6. Export
pipeline.add_stage(AudioToDocumentStage())
pipeline.add_stage(JsonlWriter(path="./per_speaker_audio"))

executor = XennaExecutor()
pipeline.run(executor)

Best Practices

Use offline mode unless you specifically need streaming: SpeakerSeparationStage is faster and more accurate than InferenceSortformerStage for batch curation.
Run VAD + concat before diarization: feeding diarization a clean concatenated speech-only waveform (no long silences) is cheaper and more reliable than feeding raw audio.
Pair with per-speaker quality filters: place the filtering chain (VAD → Band → UTMOS → SIGMOS) after speaker separation so each speaker’s audio is scored independently. Bad speakers get dropped; good speakers from the same file are kept.
Mind the 4-speaker model limit: both stages target up to 4 speakers per file. Files with more speakers will likely produce degraded diarization.
Don’t enable exclude_overlaps=False for training data: overlapping speech is hard for downstream models; only disable when explicitly preserving natural conversation.

Preprocessing Stages — SegmentConcatenationStage and TimestampMapperStage are typically paired with speaker separation.
VAD Segmentation — typical upstream stage producing the segments fed into diarization.
AudioDataFilterStage Composite — bundles offline speaker separation with per-speaker filters into the standard pipeline.

Speaker Separation

Understanding Diarization

What Diarization Does

Choosing a Stage

Offline Speaker Separation

Step 1: Configure the Stage

Step 2: Tune Overlap and Gap Handling

SpeakerSeparationStage Parameters

Streaming Speaker Diarization

When to Use Streaming

Step 1: Configure the Stage

Step 2: Tune Latency

InferenceSortformerStage Parameters

Complete Speaker Separation Pipeline

Best Practices

Related Topics

`SpeakerSeparationStage` Parameters

`InferenceSortformerStage` Parameters