ALM Data Curation

Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.

How it Works

The ALM pipeline processes audio manifests through a four-stage chain:

ALMManifestReader reads JSONL manifests line-by-line, producing one AudioTask per entry
ALMDataBuilderStage constructs candidate windows from consecutive segments, applying sample rate, bandwidth, speaker count, and duration constraints
ALMDataOverlapStage removes windows that share too much audio content, keeping windows closest to the target duration
ALMManifestWriterStage writes filtered results as JSONL

All stages run on CPU and support both Xenna and Ray Data backends.

ALM Stages

ALM Data Builder

Construct candidate training windows from diarized audio segments with quality filtering windowing speaker-count bandwidth

ALM Overlap Filtering

Remove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration

Quick Example

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.alm import (
    ALMManifestReader,
    ALMDataBuilderStage,
    ALMDataOverlapStage,
    ALMManifestWriterStage,
)

pipeline = Pipeline(name="alm_curation")

# Read input manifests
pipeline.add_stage(ALMManifestReader(manifest_path="/data/manifests/"))

# Build 120-second training windows
pipeline.add_stage(
    ALMDataBuilderStage(
        target_window_duration=120.0,
        tolerance=0.1,
        min_speakers=2,
        max_speakers=5,
    )
)

# Remove windows with more than 50% overlap
pipeline.add_stage(
    ALMDataOverlapStage(
        overlap_percentage=50,
        target_duration=120.0,
    )
)

# Write results
pipeline.add_stage(ALMManifestWriterStage(output_path="/data/output/alm.jsonl"))

ALM Pipeline Concepts: Architectural overview of the ALM pipeline
ALM Tutorial: Step-by-step guide with sample data
Manifests and Ingest: General manifest format concepts

ALM Data Curation

How it Works

ALM Stages

Quick Example

Related Topics