ASR Pipeline Architecture
This guide provides a comprehensive overview of NeMo Curator’s Automatic Speech Recognition (ASR) pipeline architecture, covering audio input processing through transcription generation and quality assessment.
Pipeline Overview
The ASR pipeline in NeMo Curator follows a systematic approach to speech processing:
graph TD
A[Audio Files] --> B[AudioTask Creation]
B --> C[ASR Model Loading]
C --> D[Batch Inference]
D --> E[Transcription Output]
E --> F[Quality Assessment]
F --> G[Filtering & Export]
subgraph "Input Stage"
A
B
end
subgraph "Processing Stage"
C
D
E
end
subgraph "Assessment Stage"
F
G
end
Core Components
1. Audio Input Management
AudioTask Structure: The foundation for audio processing
- Contains audio file paths and associated metadata
- Validates file existence and accessibility automatically
- Supports efficient batch processing for scalability
Input Validation: Ensures data integrity before processing
- File path existence checks using
AudioTask.validate()andvalidate_item() - Optional metadata validation added by downstream stages (such as duration and format checks)
2. ASR Model Integration
NeMo Framework Integration: Leverages state-of-the-art ASR models
- Automatic model downloading and caching for convenience
- GPU-accelerated inference when hardware is available
- Support for multilingual and domain-specific model variants
Model Management: Efficient resource usage
- Lazy loading of models to conserve system memory
- Automatic GPU or CPU device selection based on available resources
- Model-level batching handled within NeMo framework
3. Inference Processing
Batch Processing: Supports processing audio files together
- Audio files are processed together in a single call to the NeMo ASR model
- Batch size configuration controls task grouping for processing using
.with_(batch_size=..., resources=Resources(...)) - Internal batching and optimization handled by the NeMo framework
Output Generation: Structured transcription results
- Clean predicted text extraction from NeMo model outputs
- Complete metadata preservation throughout the processing pipeline
Processing Stages
Stage 1: Data Loading
from nemo_curator.stages.audio.datasets.fleurs.create_initial_manifest import CreateInitialManifestFleursStage
from nemo_curator.stages.text.io.reader import JsonlReader
# Data loading from datasets (e.g., FLEURS)
fleurs_stage = CreateInitialManifestFleursStage(
lang="en_us", # Language code
split="dev", # Data split
raw_data_dir="/path/to/data"
)
# Or load from custom manifest files
manifest_reader = JsonlReader(
input_file_path="/path/to/manifest.jsonl"
)
# Stages automatically create AudioTask objects from loaded data
Stage 2: ASR Model Setup
# Model initialization
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
)
# GPU/CPU device selection (based on configured resources)
device = asr_stage.check_cuda()
# Model loading
asr_stage.setup() # Downloads and loads model
Stage 3: Transcription Generation
# Don't call process() directly — the Pipeline/Executor handles dispatch:
pipeline.add_stage(asr_stage)
results = pipeline.run(executor)
# Output: AudioTask objects with added "pred_text" field
# Each task now contains both original data and predictions
Stage 4: Quality Assessment Integration
# WER calculation
wer_stage = GetPairwiseWerStage(
text_key="text",
pred_text_key="pred_text",
wer_key="wer"
)
# Duration analysis
duration_stage = GetAudioDurationStage(
audio_filepath_key="audio_filepath",
duration_key="duration"
)
Data Flow Architecture
Input Data Flow
- Audio Files → File system
- Manifest Files → JSONL format with metadata
- AudioTask Objects → Validated, structured data containers
Processing Data Flow
- Model Loading → NeMo ASR model initialization
- Batch Creation → Group audio files for efficient processing
- GPU Processing → Transcription generation
- Result Aggregation → Combine transcriptions with metadata
Output Data Flow
- Transcription Results → Predicted text for each audio file
- Quality Metrics → WER, CER, duration, and custom scores
- Filtered Datasets → High-quality audio-text pairs
- Export Formats → JSONL manifests for training workflows
Performance Characteristics
Scalability Factors
Model Selection Impact:
- Larger models provide better accuracy but require more processing time
- NeMo models support streaming capabilities, though this stage performs offline transcription
- Language-specific models improve accuracy for target languages
Hardware Usage:
- GPU acceleration typically outperforms CPU processing for larger workloads
- Memory requirements scale proportionally with model size and audio input lengths
Optimization Strategies
Memory Management:
# Optimize for memory-constrained environments
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_small" # Smaller model
).with_(
resources=Resources(gpus=0.5) # Request fractional GPU using executor/backends
)
Resource Configuration:
# Configure resources for processing
asr_stage = InferenceAsrNemoStage(
model_name="nvidia/stt_en_fastconformer_hybrid_large_pc"
).with_(
resources=Resources(gpus=1.0) # Dedicated GPU
)
Error Handling and Recovery
Audio Processing Errors
# Validate and filter invalid file paths
audio_task = AudioTask(data=audio_data, filepath_key="audio_filepath")
# Validate the audio file exists on disk
is_valid = audio_task.validate()
Pipeline Recovery
For guidance on resumable processing and recovery at the executor and backend level, refer to Resumable Processing.
Integration Points
Text Processing Integration
The ASR pipeline seamlessly integrates with text processing workflows:
# Audio → Text pipeline
audio_to_text = [
InferenceAsrNemoStage(), # Audio → Transcriptions
AudioToDocumentStage(), # AudioTask → DocumentBatch
# Continue with text processing stages...
]