Audio Curation Concepts
This guide covers the essential concepts for audio data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with speech processing and machine learning principles.
Core Concept Areas
Audio curation in NVIDIA NeMo Curator focuses on these key areas:
Modality-level overview of ingest, validation, optional ASR, metrics, filtering, and export overview map
AudioTask StructureUnderstanding the AudioTask data structure and audio file management data-structures validation
ASR PipelineComprehensive overview of the automatic speech recognition pipeline and workflow overview architecture
Quality MetricsCore concepts for evaluating speech transcription quality and audio characteristics wer cer metrics
Dataset Manifests and IngestConcepts for constructing manifests and ingesting audio datasets manifests ingest
ALM PipelineAudio Language Model data curation pipeline for extracting training windows from diarized segments alm windowing speaker-diarization
Text IntegrationConcepts for integrating audio processing with text curation workflows multimodal integration
Infrastructure Components
The audio curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities. These components include:
Optimize memory usage when processing large audio datasets partitioning batching monitoring
GPU AccelerationLeverage NVIDIA GPUs for faster ASR inference and audio processing cuda nemo-toolkit performance
Resumable ProcessingContinue interrupted operations across large audio datasets checkpoints recovery batching