Frame Extraction

Extract frames from clips or full videos at target rates and resolutions. Use frames for embeddings (such as Cosmos‑Embed1), aesthetic filtering, previews, and custom analysis.

Use Cases

Prepare inputs for embedding models that expect frame sequences.
Run aesthetic filtering that operates on sampled frames.
Generate lightweight previews or QA snapshots.
Provide frames for scene-change detection before clipping (TransNetV2).

Before You Start

If you need saved media files, frame extraction is optional. Embeddings and aesthetic filtering require frames.

Quickstart

Use the pipeline stages or the example script flags to extract frames for embeddings, filtering, and analysis.

Pipeline Stage

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.clipping.clip_extraction_stages import FixedStrideExtractorStage
from nemo_curator.stages.video.clipping.clip_frame_extraction import ClipFrameExtractionStage
from nemo_curator.utils.decoder_utils import FrameExtractionPolicy, FramePurpose
from nemo_curator.stages.video.embedding.cosmos_embed1 import (
    CosmosEmbed1FrameCreationStage,
    CosmosEmbed1EmbeddingStage,
)

pipe = Pipeline(name="clip_frames_embeddings")
pipe.add_stage(FixedStrideExtractorStage(clip_len_s=10.0, clip_stride_s=10.0))
pipe.add_stage(
    ClipFrameExtractionStage(
        extraction_policies=(FrameExtractionPolicy.sequence,),
        extract_purposes=(FramePurpose.EMBEDDINGS,),
        target_res=(-1, -1),
        verbose=True,
    )
)
pipe.add_stage(CosmosEmbed1FrameCreationStage(model_dir="/models", variant="224p", target_fps=2.0, verbose=True))
pipe.add_stage(CosmosEmbed1EmbeddingStage(model_dir="/models", variant="224p", gpu_memory_gb=20.0, verbose=True))
pipe.run()

Script Flags

# Clip frames implicitly when generating embeddings or aesthetics
python tutorials/video/getting-started/video_split_clip_example.py \
  ... \
  --generate-embeddings \
  --clip-extraction-target-res -1

# Full-video frames for TransNetV2 scene change
python tutorials/video/getting-started/video_split_clip_example.py \
  ... \
  --splitting-algorithm transnetv2 \
  --transnetv2-frame-decoder-mode pynvc

Options in NeMo Curator

NeMo Curator provides two complementary stages:

ClipFrameExtractionStage: Extracts frames from already‑split clips. Supports several target FPS values and computes an LCM rate to reduce decode work.
VideoFrameExtractionStage: Extracts frames from full videos (for example, before scene‑change detection). Supports PyNvCodec (NVDEC) or ffmpeg CPU/GPU decode.

Extract Frames

From Clips

from nemo_curator.stages.video.clipping.clip_frame_extraction import (
    ClipFrameExtractionStage,
)
from nemo_curator.utils.decoder_utils import FrameExtractionPolicy, FramePurpose

extract_frames = ClipFrameExtractionStage(
    extraction_policies=(FrameExtractionPolicy.sequence,),
    extract_purposes=(FramePurpose.EMBEDDINGS,),  # sets default FPS if target_fps not provided
    target_res=(-1, -1),  # keep original resolution
    # target_fps=[1, 2],  # optional: override with explicit FPS values
    verbose=True,
)

From Full Videos (Scene Change)

from nemo_curator.stages.video.clipping.video_frame_extraction import VideoFrameExtractionStage

frame_extractor = VideoFrameExtractionStage(
    decoder_mode="pynvc",  # or "ffmpeg_gpu", "ffmpeg_cpu"
    output_hw=(27, 48),    # (height, width) for frame extraction
    pyncv_batch_size=64,   # batch size for PyNvCodec
    verbose=True,
)

Parameters

Parameter	Description
`extraction_policies`	Frame selection strategy. Use `sequence` for uniform sampling. `middle` selects a single middle frame.
`target_fps`	For clips: sampling rate in frames per second. If you provide several integer values, the stage uses LCM sampling.
`extract_purposes`	Shortcut that sets default FPS for specific purposes (such as embeddings). You can still pass `target_fps` to override.
`target_res`	Output frame resolution `(height, width)`. Use `(-1, -1)` to keep original.
`num_cpus`	Number of CPU cores for frame extraction. Default: `3`.
`decoder_mode`	For full‑video extraction: `pynvc` (NVDEC), `ffmpeg_gpu`, or `ffmpeg_cpu`.
`output_hw`	For full‑video extraction: `(height, width)` tuple for frame dimensions. Default: `(27, 48)`.
`pyncv_batch_size`	For full‑video extraction: batch size for PyNvCodec processing. Default: `64`.

LCM Sampling for Several FPS Values

If you provide several integer target_fps values (such as 1 and 2), the clip stage decodes once at the LCM rate and then samples every k‑th frame to produce each target rate. This reduces decode cost.

ClipFrameExtractionStage(
    extraction_policies=(FrameExtractionPolicy.sequence,),
    target_fps=[1, 2],  # LCM = 2; decode once at 2 FPS, then subsample
)

Hardware and Performance

Prefer pynvc (NVDEC) or ffmpeg_gpu for high throughput when GPU hardware is available; otherwise use ffmpeg_cpu.
Use batching where applicable and track worker resource use.
Keep resolution modest if memory limits apply; set target_res when needed.

Downstream Dependencies

Embeddings: Cosmos‑Embed1 expects frames at specific rates. Refer to Embeddings.
Aesthetic Filtering: Requires frames extracted earlier. Refer to Filtering.
Clipping with TransNetV2: Uses full‑video frame extraction before scene‑change detection. Refer to Clipping.

Troubleshooting

“Frame extraction failed”: Check decoder mode and availability; confirm ffmpeg and drivers for GPU modes.
Not enough frames for embeddings: Increase target_fps or adjust clip length; certain embedding stages can re‑extract at a higher rate when needed.