Save and Export
NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.
Writer Stage
Use ClipWriterStage as the final stage in your pipeline.
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
pipeline.add_stage(
ClipWriterStage(
output_path=OUT_DIR,
input_path=VIDEO_DIR,
upload_clips=True,
dry_run=False,
generate_embeddings=True,
generate_previews=False,
generate_captions=False,
embedding_algorithm="cosmos-embed1-224p",
caption_models=["qwen"],
enhanced_caption_models=["qwen_lm"],
verbose=True,
)
)
Parameters
| Parameter | Type | Description |
|---|---|---|
output_path | str | Base directory or URI for outputs. |
input_path | str | Root of input videos; used to derive processed metadata paths. Must be a prefix of input video paths. |
upload_clips | bool | Write .mp4 clips to clips/ and filtered clips to filtered_clips/. |
dry_run | bool | Skip writing clip bytes, preview images, embeddings, and per-clip metadata. The stage still writes video-level and chunk-level metadata. |
generate_embeddings | bool | When true, the stage logs errors if embeddings for the selected algorithm are missing. When embeddings exist, the stage writes per-clip pickles and per-chunk Parquet files. |
generate_previews | bool | When true, the stage logs errors for missing preview bytes and writes .webp images when present. |
generate_captions | bool | The stage includes captions in metadata when upstream stages provide them. |
embedding_algorithm | str | Accepted: cosmos-embed1-224p, cosmos-embed1-336p, or cosmos-embed1-448p. Default: cosmos-embed1-224p. |
caption_models | list[str] | None | Ordered caption models to emit. Use [] when not using captions. |
enhanced_caption_models | list[str] | None | Ordered enhancement models to emit. Use [] when not using enhanced captions. |
verbose | bool | Emit detailed logs. |
max_workers | int | Thread pool size for writing. |
log_stats | bool | Reserved for future detailed stats logging. |
Output Directories
The writer produces these directories under output_path:
clips/: Encoded clip media (.mp4).filtered_clips/: Media for filtered-out clips.previews/: Preview images (.webp).metas/v0/: Per-clip metadata (.json).ce1_embd/: Per-clip embeddings (.pickle).ce1_embd_parquet/: Parquet batches with columnsidandembedding.processed_videos/,processed_clip_chunks/: Video-level metadata and per-chunk statistics.
Per-Clip Metadata
Each clip writes a JSON file under metas/v0/ with clip- and window-level fields:
{
"span_uuid": "d2d0b3d1-...",
"source_video": "/data/videos/vid.mp4",
"duration_span": [0.0, 5.0],
"width_source": 1920,
"height_source": 1080,
"framerate_source": 30.0,
"clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
"motion_score": { "global_mean": 0.51, "per_patch_min_256": 0.29 },
"aesthetic_score": 0.72,
"windows": [
{
"start_frame": 0,
"end_frame": 30,
"qwen_caption": "A person walks across a room",
"qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
}
],
"valid": true
}
- Caption keys follow
<model>_captionand<model>_enhanced_caption, based oncaption_modelsandenhanced_caption_models. - With
dry_run=True, per-clip metadata is not written. Video- and chunk-level metadata are still written. - The stage writes video-level metadata and per-chunk stats to
processed_videos/andprocessed_clip_chunks/.
Embeddings and Parquet outputs
- When embeddings exist, the stage writes per-clip
.picklefiles underce1_embd/. - The stage also batches embeddings per clip chunk into Parquet files under
ce1_embd_parquet/with columnsidandembeddingand writes those files to disk.
Helpers
Resolve Paths Programmatically
Use helpers to construct paths consistently:
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
OUT = "/outputs"
clips_dir = ClipWriterStage.get_output_path_clips(OUT)
filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
previews_dir = ClipWriterStage.get_output_path_previews(OUT)
metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)