ALM Data Curation
Curate training data for audio language models by extracting fixed-duration windows from diarized audio segments. The ALM stages read JSONL manifests, build candidate windows that meet quality constraints, remove overlapping windows, and write the filtered results.
How it Works
The ALM pipeline processes audio manifests through a four-stage chain:
- ALMManifestReader reads JSONL manifests line-by-line, producing one
AudioTaskper entry - ALMDataBuilderStage constructs candidate windows from consecutive segments, applying sample rate, bandwidth, speaker count, and duration constraints
- ALMDataOverlapStage removes windows that share too much audio content, keeping windows closest to the target duration
- ALMManifestWriterStage writes filtered results as JSONL
All stages run on CPU and support both Xenna and Ray Data backends.
ALM Stages
ALM Data Builder
Construct candidate training windows from diarized audio segments with quality filtering windowing speaker-count bandwidth
ALM Overlap FilteringRemove redundant overlapping windows based on configurable thresholds deduplication overlap-ratio target-duration
Quick Example
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.audio.alm import (
ALMManifestReader,
ALMDataBuilderStage,
ALMDataOverlapStage,
ALMManifestWriterStage,
)
pipeline = Pipeline(name="alm_curation")
# Read input manifests
pipeline.add_stage(ALMManifestReader(manifest_path="/data/manifests/"))
# Build 120-second training windows
pipeline.add_stage(
ALMDataBuilderStage(
target_window_duration=120.0,
tolerance=0.1,
min_speakers=2,
max_speakers=5,
)
)
# Remove windows with more than 50% overlap
pipeline.add_stage(
ALMDataOverlapStage(
overlap_percentage=50,
target_duration=120.0,
)
)
# Write results
pipeline.add_stage(ALMManifestWriterStage(output_path="/data/output/alm.jsonl"))
Related Topics
- ALM Pipeline Concepts: Architectural overview of the ALM pipeline
- ALM Tutorial: Step-by-step guide with sample data
- Manifests and Ingest: General manifest format concepts