Process Data
Use NeMo Curator stages to split videos into clips, encode them, generate embeddings or captions, and remove duplicates.
How it Works
Create a Pipeline and add stages for clip extraction, optional re-encoding and filtering, embeddings or captions, previews, and writing outputs. Each stage is modular and configurable to match your quality and performance needs.
Processing Options
Choose from the following stages to split, encode, filter, embed, caption, preview, and remove duplicates in your videos:
Split long videos into shorter clips using fixed stride or scene-change detection. clips fixed-stride transnetv2
Encode ClipsEncode clips to H.264 using CPU or GPU encoders and tune performance. clips h264_nvenc
Filter Clips and FramesApply motion-based filtering and aesthetic filtering to improve dataset quality. clips frames motion aesthetic
Extract FramesExtract frames from clips or full videos for embeddings, filtering, and analysis. frames frames fps
Create EmbeddingsGenerate clip-level embeddings with Cosmos-Embed1 for search and duplicate removal. clips cosmos-embed1
Create Captions & PreviewProduce clip captions and optional preview images for review workflows. clips frames captions preview
Remove Duplicate EmbeddingsRemove near-duplicates using semantic clustering and similarity with generated embeddings. clips semantic pairwise
Write Outputs
Persist clips, embeddings, previews, and metadata at the end of the pipeline using ClipWriterStage. Refer to Save & Export for directory layout and examples.
Example (place as the final stage):
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
pipeline.add_stage(
ClipWriterStage(
output_path=OUT_DIR,
input_path=VIDEO_DIR,
upload_clips=True,
dry_run=False,
generate_embeddings=True,
generate_previews=False,
generate_captions=False,
embedding_algorithm="cosmos-embed1-224p",
caption_models=[],
enhanced_caption_models=[],
verbose=True,
)
)
Path helpers are available to resolve common locations (such as clips/, filtered_clips/, previews/, metas/v0/, and ce1_embd_parquet/).