ALM Overlap Filtering
ALMDataOverlapStage removes redundant training windows that share too much audio content. When two windows overlap beyond a configurable threshold, the stage keeps the window whose duration is closest to the target and discards the other.
How it Works
The stage processes each AudioTask independently:
- Extracts the
windowslist produced byALMDataBuilderStage - Sorts windows by start time
- For each window, compares it against every later window whose start falls before its end — all pairs that overlap in time, not only adjacent ones — and calculates the overlap ratio (overlap duration divided by the shorter window duration)
- When the overlap ratio meets the threshold, greedily removes the window whose duration is further from
target_duration - Writes filtered results back to the task
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
overlap_percentage | int | 0 | Overlap threshold from 0 to 100. Lower values remove more windows. |
target_duration | float | 120.0 | Preferred window duration in seconds, used for tie-breaking |
Overlap Percentage Behavior
| Value | Behavior | Typical Use Case |
|---|---|---|
| 0 | Remove any overlapping windows | Maximum deduplication, smallest output |
| 50 | Remove windows with 50% or more overlap | Balanced yield and diversity |
| 100 | Keep all windows except fully-contained duplicates (ratio = 1.0) | Minimum filtering, largest output |
Basic Usage
from nemo_curator.stages.audio.alm import ALMDataOverlapStage
# Remove windows with any overlap
overlap_filter = ALMDataOverlapStage(
overlap_percentage=0,
target_duration=120.0,
)
Advanced Configuration
Moderate Filtering
# Keep windows unless they overlap by more than 50%
overlap_filter = ALMDataOverlapStage(
overlap_percentage=50,
target_duration=120.0,
)
Short-Window Pipeline
When using shorter target windows, match the target_duration parameter:
overlap_filter = ALMDataOverlapStage(
overlap_percentage=30,
target_duration=30.0, # Match ALMDataBuilderStage target
)
Output Fields
The stage adds the following user-facing fields to each AudioTask:
| Field | Type | Description |
|---|---|---|
filtered_windows | list | Windows that passed overlap filtering |
filtered_dur | float | Total duration of filtered windows in seconds |
filtered_dur_list | list | Duration of each individual filtered window |
total_dur_window | float | Total duration of all input windows before filtering |
manifest_filepath | str | Source manifest path carried through from the builder stage |
The stage also writes several intermediate fields (total_dur_list_window, total_dur_list_window_timestamps, filtered, swift_filepath) that are primarily used for internal bookkeeping. The original windows list produced by ALMDataBuilderStage is preserved so downstream consumers can compare pre- and post-filter results.
Tuning the Overlap Threshold
The right threshold depends on your training requirements:
- For diverse training data, use a low
overlap_percentage(0 to 30) to maximize the variety of audio content in the training set - For maximum training volume, use a higher
overlap_percentage(70 to 100) to retain more windows at the cost of some redundancy - For balanced results, use
overlap_percentage=50as a starting point and adjust based on the ratio offiltered_windowsto inputwindows
Monitor the yield by comparing filtered_dur to total_dur_window in the output.
Related Topics
- ALM Data Builder: Previous stage in the ALM pipeline
- ALM Pipeline Concepts: Architectural overview
- ALM Tutorial: End-to-end walkthrough with sample data