NVIDIANeMo Curator
Menu

Captions and Preview

Prepare inputs, generate captions, optionally enhance them, and produce preview images.

Choosing a Captioning Model

The video captioning pipeline supports two model families. Pick a variant based on quality, GPU memory, and throughput:

VariantModelDefault Use Case
qwenQwen/Qwen2.5-VL-7B-InstructDefault — good quality/throughput balance
nemotron / nemotron-bf16Nemotron Nano 12B v2 VL (BF16)High-quality captions; auto-downloaded from Hugging Face
nemotron-fp8Nemotron Nano 12B v2 VL (FP8)Same model, FP8-quantized for lower memory
nemotron-nvfp4Nemotron Nano 12B v2 VL (NVFP4-QAD)NVFP4 quantization-aware-distilled checkpoint

Caption enhancement (the optional second-pass LLM rewrite) uses Qwen-LM (--enhance-captions-algorithm qwen_lm).


Quickstart

Use the pipeline stages or the example script flags to prepare captions and preview images.

Pipeline Stage

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
from nemo_curator.stages.video.preview.preview import PreviewStage

pipe = Pipeline(name="captions_preview")
pipe.add_stage(
    CaptionPreparationStage(
        model_variant="qwen",
        prompt_variant="default",
        prompt_text=None,
        sampling_fps=2.0,
        window_size=256,
        remainder_threshold=128,
        preprocess_dtype="float16",
        model_does_preprocess=False,
        generate_previews=True,
        verbose=True,
    )
)
pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
pipe.add_stage(
    CaptionGenerationStage(
        model_dir="/models",
        model_variant="qwen",
        caption_batch_size=8,
        fp8=False,
        max_output_tokens=512,
        model_does_preprocess=False,
        generate_stage2_caption=False,
        stage2_prompt_text=None,
        disable_mmcache=True,
    )
)
pipe.run()

To use Nemotron instead, set model_variant="nemotron" (or one of nemotron-bf16, nemotron-fp8, nemotron-nvfp4) on both CaptionPreparationStage and CaptionGenerationStage — Nemotron weights are auto-downloaded from Hugging Face on first use.

Script Flags

python tutorials/video/getting-started/video_split_clip_example.py \
  ... \
  --generate-captions \
  --captioning-algorithm qwen \
  --captioning-window-size 256 \
  --captioning-remainder-threshold 128 \
  --captioning-sampling-fps 2.0 \
  --captioning-preprocess-dtype float16 \
  --captioning-batch-size 8 \
  --captioning-max-output-tokens 512 \
  --generate-previews \
  --preview-target-fps 1.0 \
  --preview-target-height 240

--captioning-algorithm accepts: qwen (default, Qwen2.5-VL-7B-Instruct), nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4. To enable caption enhancement with the Qwen LM, also pass --enhance-captions --enhance-captions-algorithm qwen_lm.

Preparation and previews

  1. Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for the chosen VLM (Qwen‑VL or Nemotron), and optionally stores per‑window mp4 bytes for previews.

    from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
    from nemo_curator.stages.video.preview.preview import PreviewStage
    
    prep = CaptionPreparationStage(
        model_variant="qwen",  # or "nemotron" / "nemotron-fp8" / ...
        prompt_variant="default",
        prompt_text=None,
        sampling_fps=2.0,
        window_size=256,
        remainder_threshold=128,
        preprocess_dtype="float16",
        model_does_preprocess=False,
        generate_previews=True,
        verbose=True,
    )
  2. Optionally generate .webp previews from each window’s mp4 bytes for quick QA and review.

    preview = PreviewStage(
        target_fps=1.0,
        target_height=240,
        verbose=True,
    )

Parameters

CaptionPreparationStage

ParameterTypeDefaultDescription
model_variantstr"qwen"Vision‑language model used to format inputs. One of qwen, nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4.
prompt_variantav-surveillance"default"Built‑in prompt to steer caption content when prompt_text is not provided.
prompt_textstr | NoneNoneCustom prompt text. When set, overrides prompt_variant.
sampling_fpsfloat2.0Source sampling rate for creating per‑window inputs.
window_sizeint256Number of frames per window before captioning.
remainder_thresholdint128Minimum leftover frames required to create a final shorter window.
model_does_preprocessboolFalseWhether the downstream model performs its own preprocessing.
preprocess_dtypestr"float32"Data type for any preprocessing performed here.
generate_previewsboolTrueWhen True, return per‑window mp4 bytes to enable preview generation.
verboseboolFalseLog additional setup and per‑clip details.

PreviewStage

ParameterTypeDefaultDescription
target_fpsfloat1.0Frames per second for preview encoding.
target_heightint240Output height in pixels; width auto‑scales to preserve aspect ratio.
compression_levelint (0–6)6WebP compression level (0 = lossless, higher = smaller files).
qualityint (0–100)50WebP quality factor (100 = best quality, larger files).
num_cpus_per_workerfloat4.0CPU threads mapped to ffmpeg -threads for encoding.
verboseboolFalseLog warnings and per‑window encoding details.

Caption generation and enhancement

  1. Generate window‑level captions with the chosen VLM (Qwen‑VL or Nemotron). This stage reads clip.windows[*].qwen_llm_input (created earlier) and writes window.caption["qwen"] (or window.caption["nemotron"], depending on the variant).

    from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
    from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
    
    gen = CaptionGenerationStage(
        model_dir="/models",
        model_variant="qwen",  # or "nemotron" / "nemotron-fp8" / ...
        caption_batch_size=8,
        fp8=False,
        max_output_tokens=512,
        model_does_preprocess=False,
        generate_stage2_caption=False,
        stage2_prompt_text=None,
        disable_mmcache=True,
    )
    
  2. Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads window.caption["qwen"] and writes window.enhanced_caption["qwen_lm"].

    enh = CaptionEnhancementStage(
        model_dir="/models",
        model_variant="qwen",
        prompt_variant="default",
        prompt_text=None,
        model_batch_size=128,
        fp8=False,
        max_output_tokens=512,
        verbose=True,
    )

Parameters

CaptionGenerationStage

ParameterTypeDefaultDescription
model_dirstr"models/qwen"Directory for model weights; downloaded on each node if missing.
model_variantstr"qwen"Vision‑language model variant. One of qwen, nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4.
caption_batch_sizeint16Batch size for caption generation.
fp8boolFalseUse FP8 weights when available.
max_output_tokensint512Maximum number of tokens to generate per caption.
model_does_preprocessboolFalseWhether the model performs its own preprocessing.
disable_mmcacheboolFalseDisable multimodal cache for generation backends that support it.
generate_stage2_captionboolFalseEnable a second‑pass caption for refinement.
stage2_prompt_textstr | NoneNoneCustom prompt for stage‑2 caption refinement.
verboseboolFalseEmit additional logs during generation.

CaptionEnhancementStage

ParameterTypeDefaultDescription
model_dirstr"models/qwen"Directory for language‑model weights; downloaded per node if missing.
model_variantqwen"qwen"Language‑model variant.
prompt_variantav-surveillance"default"Built‑in enhancement prompt when prompt_text is not provided.
prompt_textstr | NoneNoneCustom enhancement prompt. When set, overrides prompt_variant.
model_batch_sizeint128Batch size for enhancement generation.
fp8boolFalseUse FP8 weights when available.
max_output_tokensint512Maximum number of tokens to generate per enhanced caption.
verboseboolFalseEmit additional logs during enhancement.

Preview Generation

Generate lightweight .webp previews for each caption window to support review and QA workflows. A dedicated PreviewStage reads per-window mp4 bytes and encodes WebP using ffmpeg.

Preview Parameters

  • target_fps (default 1.0): Target frames per second for preview generation.
  • target_height (default 240): Output height. Width auto-scales to preserve aspect ratio.
  • compression_level (range 0–6, default 6): WebP compression level. 0 is lossless; higher values reduce size with lower quality.
  • quality (range 0–100, default 50): WebP quality. Higher values increase quality and size.
  • num_cpus_per_worker (default 4.0): Number of CPU threads mapped to ffmpeg -threads.
  • verbose (default False): Emit more logs.

Behavior notes:

  • If the input frame rate is lower than target_fps or the input height is lower than target_height, the stage logs a warning and preview quality can degrade.
  • If ffmpeg fails, the stage logs the error and skips assigning preview bytes for that window.

Example: Configure PreviewStage

from nemo_curator.stages.video.preview.preview import PreviewStage

preview = PreviewStage(
    target_fps=1.0,
    target_height=240,
    compression_level=6,
    quality=50,
    num_cpus_per_worker=4.0,
    verbose=False,
)

Outputs

The stage writes .webp files under the previews/ directory that ClipWriterStage manages. Use the helper to resolve the path:

from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")

Refer to Save & Export for directory structure and file locations: Save & Export.

Requirements and Troubleshooting

  • ffmpeg with WebP (libwebp) support must be available in the environment.
  • If you observe warnings about low frame rate or height, consider lowering target_fps or target_height to better match inputs.
  • On encoding errors, check logs for the ffmpeg command and output to diagnose missing encoders.