ALM Pipeline Tutorial
Learn how to curate training data for audio language models using NVIDIA NeMo Curator’s ALM pipeline. This tutorial walks you through reading diarized audio manifests, constructing fixed-duration training windows, filtering overlapping windows, and writing the results.
Overview
This tutorial demonstrates the ALM data curation workflow:
- Read Manifests: Stream JSONL manifests with diarized audio metadata
- Build Windows: Construct candidate training windows from consecutive segments
- Filter Overlaps: Remove redundant windows that share too much audio content
- Write Results: Export filtered windows as JSONL for downstream training
What you will learn:
- How to configure and run the four-stage ALM pipeline
- Tuning window duration, speaker count, and quality thresholds
- Selecting between Xenna and Ray Data backends
- Interpreting pipeline output and loss statistics
Working Example Location
The complete working code for this tutorial is located at:
<nemo_curator_repository>/tutorials/audio/alm/
├── README.md # Tutorial documentation
├── main.py # Hydra-based pipeline runner
└── pipeline.yaml # Pipeline configuration
Accessing the code:
git clone https://github.com/NVIDIA/NeMo-Curator.git
cd NeMo-Curator/tutorials/audio/alm/
Prerequisites
- NeMo Curator installed with audio extras (refer to the Installation Guide)
- Python 3.10 or later
- Input data in JSONL format with diarized segments (refer to the input format section)
- Basic familiarity with Hydra configuration
Input Format
Each line of the input JSONL manifest must contain the following fields:
{
"audio_filepath": "/path/to/audio.wav",
"audio_sample_rate": 16000,
"segments": [
{
"start": 0.0,
"end": 5.2,
"speaker": "speaker_0",
"text": "transcript text",
"metrics": {"bandwidth": 8000}
}
]
}
Required fields:
audio_filepath: Path to the source audio fileaudio_sample_rate: Sample rate in Hz (entries belowmin_sample_rateare skipped)segments: Array of diarized speech segments, each withstart,end,speaker, andmetrics.bandwidth
Sample input data is available at tests/fixtures/audio/alm/sample_input.jsonl in the repository.
Step-by-Step Walkthrough
Step 1: Review the Pipeline Configuration
The ALM pipeline is defined in pipeline.yaml with four stages:
stages:
# Stage 0: Read JSONL manifests
- _target_: nemo_curator.stages.audio.alm.ALMManifestReader
manifest_path: ${manifest_path}
files_per_partition: 1
# Stage 1: Build candidate windows
- _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
target_window_duration: 120.0
tolerance: 0.1
min_sample_rate: 16000
min_bandwidth: 8000
min_speakers: 2
max_speakers: 5
truncation: true
drop_fields: "words"
drop_fields_top_level: "words,segments"
# Stage 2: Filter overlapping windows
- _target_: nemo_curator.stages.audio.alm.ALMDataOverlapStage
overlap_percentage: 50
target_duration: 120.0
# Stage 3: Write filtered output
- _target_: nemo_curator.stages.audio.alm.ALMManifestWriterStage
output_path: ${output_dir}/alm_output.jsonl
Step 2: Understand the Configuration Parameters
The following table describes the key parameters for each stage:
ALMDataBuilderStage parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
target_window_duration | float | 120.0 | Target window length in seconds |
tolerance | float | 0.1 | Acceptable deviation from target (10% means 108 to 132 seconds) |
min_sample_rate | int | 16,000 | Minimum sample rate in Hz |
min_bandwidth | int | 8,000 | Minimum bandwidth per segment in Hz |
min_speakers | int | 2 | Minimum distinct speakers per window |
max_speakers | int | 5 | Maximum distinct speakers per window |
truncation | bool | True | Truncate segments exceeding maximum duration |
drop_fields | str | "words" | Comma-separated segment-level fields to remove |
drop_fields_top_level | str | "words,segments" | Comma-separated entry-level fields to remove |
ALMDataOverlapStage parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
overlap_percentage | int | 0 | Overlap threshold (0 = aggressive, 100 = keep all) |
target_duration | float | 120.0 | Preferred window duration for tie-breaking |
Step 3: Run the Pipeline
Run the pipeline using the Hydra-based runner:
# Using default Xenna backend
python main.py \
--config-path . \
--config-name pipeline \
manifest_path=/path/to/manifests \
output_dir=./alm_output
# Using Ray Data backend
python main.py \
--config-path . \
--config-name pipeline \
manifest_path=/path/to/manifests \
output_dir=./alm_output \
backend=ray_data
Override individual stage parameters from the command line:
# Shorter windows with stricter overlap filtering
python main.py \
--config-path . \
--config-name pipeline \
manifest_path=/path/to/manifests \
output_dir=./alm_output \
stages.1.target_window_duration=60 \
stages.2.overlap_percentage=30
Step 4: Run with the Sample Data
Test the pipeline with the included sample data:
Run this command from the repository root so the fixture path matches what the in-repo tutorials/audio/alm/README.md uses:
# From the NeMo-Curator repo root
python tutorials/audio/alm/main.py \
--config-path . \
--config-name pipeline \
manifest_path=tests/fixtures/audio/alm/sample_input.jsonl \
output_dir=./sample_output
Expected output with sample data (five input entries):
- 181 candidate windows from the builder stage
- 25 filtered windows after overlap filtering at 50% threshold
- Approximately 3,035 seconds of total filtered audio duration
Understanding the Results
After the pipeline completes, the output JSONL file contains one line per input entry. The example below highlights the most common fields; real output also includes the pre-filter candidate windows list and additional duration and diagnostic counters (dur_lost_bw, dur_lost_sr, audio_sample_rate, manifest_filepath) that are omitted here for brevity.
{
"audio_filepath": "/path/to/audio.wav",
"windows": ["<all candidate windows from the builder stage>"],
"filtered_windows": [
{
"segments": [
{"start": 0.0, "end": 5.2, "speaker": "speaker_0"}
],
"speaker_durations": [45.2, 38.1, 22.5, 14.2, 0.0]
}
],
"filtered_dur": 120.5,
"filtered_dur_list": [120.5],
"total_dur_window": 3250.0,
"truncation_events": 3,
"stats": {
"total_segments": 150,
"total_dur": 3600.0,
"lost_bw": 5,
"lost_sr": 0,
"lost_spk": 12,
"lost_win": 8,
"lost_no_spkr": 2,
"lost_next_seg_bm": 1
}
}
Key output fields:
windows: All candidate windows produced byALMDataBuilderStagebefore overlap filtering (preserved so you can compare pre- and post-filter results)filtered_windows: Windows that passed both quality and overlap filteringspeaker_durations: Top five speakers by duration within each window, zero-padded to length fivefiltered_dur: Total duration of all filtered windows for this entryfiltered_dur_list: Duration of each individual filtered windowtotal_dur_window: Total duration of all input windows before filteringstats: Breakdown of why segments were excluded (bandwidth, sample rate, speaker count, window constraints)truncation_events: Number of segments that were truncated to fit within the maximum window duration
Reading the Loss Statistics
The stats dictionary helps diagnose low pipeline yield:
| Statistic | Meaning | Tuning Action |
|---|---|---|
lost_bw | Segments below minimum bandwidth | Lower min_bandwidth if audio quality is acceptable |
lost_sr | Entries below minimum sample rate | Lower min_sample_rate or resample input audio |
lost_spk | Windows outside speaker count range | Widen min_speakers and max_speakers range |
lost_win | Windows outside duration tolerance | Increase tolerance or adjust target_window_duration |
lost_no_spkr | Window growth blocked by a segment without a speaker label (sub-category of lost_win) | Improve upstream diarization or filter out unlabeled segments before curation |
lost_next_seg_bm | Window growth blocked by a low-bandwidth segment (sub-category of lost_win) | Lower min_bandwidth if the blocked segments are otherwise acceptable |
Customization Examples
Shorter Windows for Fine-Tuning
stages:
- _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
target_window_duration: 30.0
tolerance: 0.2
min_speakers: 2
max_speakers: 3
Permissive Filtering for Maximum Yield
stages:
- _target_: nemo_curator.stages.audio.alm.ALMDataBuilderStage
min_bandwidth: 4000
min_sample_rate: 8000
min_speakers: 1
max_speakers: 10
- _target_: nemo_curator.stages.audio.alm.ALMDataOverlapStage
overlap_percentage: 80
Processing Multiple Manifest Files
Pass a list of paths or a directory:
python main.py \
--config-path . \
--config-name pipeline \
manifest_path=/data/manifests/ \
output_dir=./alm_output
The ALMManifestReader discovers all .jsonl and .json files in the directory and its subdirectories.
Next Steps
After completing this tutorial, explore:
- ALM Data Builder: Detailed reference for window construction
- ALM Overlap Filtering: Detailed reference for overlap filtering
- ALM Pipeline Concepts: Architectural overview
- Beginner Tutorial: FLEURS-based ASR pipeline for comparison
Related Topics
- Audio Curation Pipeline: Broader audio curation workflow
- Manifests and Ingest: Manifest format concepts
- Execution Backends: Xenna and Ray Data backend details