Captions and Preview
Prepare inputs, generate captions, optionally enhance them, and produce preview images.
Choosing a Captioning Model
The video captioning pipeline supports two model families. Pick a variant based on quality, GPU memory, and throughput:
| Variant | Model | Default Use Case |
|---|---|---|
qwen | Qwen/Qwen2.5-VL-7B-Instruct | Default — good quality/throughput balance |
nemotron / nemotron-bf16 | Nemotron Nano 12B v2 VL (BF16) | High-quality captions; auto-downloaded from Hugging Face |
nemotron-fp8 | Nemotron Nano 12B v2 VL (FP8) | Same model, FP8-quantized for lower memory |
nemotron-nvfp4 | Nemotron Nano 12B v2 VL (NVFP4-QAD) | NVFP4 quantization-aware-distilled checkpoint |
Caption enhancement (the optional second-pass LLM rewrite) uses Qwen-LM (--enhance-captions-algorithm qwen_lm).
Quickstart
Use the pipeline stages or the example script flags to prepare captions and preview images.
Pipeline Stage
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
from nemo_curator.stages.video.preview.preview import PreviewStage
pipe = Pipeline(name="captions_preview")
pipe.add_stage(
CaptionPreparationStage(
model_variant="qwen",
prompt_variant="default",
prompt_text=None,
sampling_fps=2.0,
window_size=256,
remainder_threshold=128,
preprocess_dtype="float16",
model_does_preprocess=False,
generate_previews=True,
verbose=True,
)
)
pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
pipe.add_stage(
CaptionGenerationStage(
model_dir="/models",
model_variant="qwen",
caption_batch_size=8,
fp8=False,
max_output_tokens=512,
model_does_preprocess=False,
generate_stage2_caption=False,
stage2_prompt_text=None,
disable_mmcache=True,
)
)
pipe.run()To use Nemotron instead, set model_variant="nemotron" (or one of nemotron-bf16, nemotron-fp8, nemotron-nvfp4) on both CaptionPreparationStage and CaptionGenerationStage — Nemotron weights are auto-downloaded from Hugging Face on first use.
Script Flags
python tutorials/video/getting-started/video_split_clip_example.py \
... \
--generate-captions \
--captioning-algorithm qwen \
--captioning-window-size 256 \
--captioning-remainder-threshold 128 \
--captioning-sampling-fps 2.0 \
--captioning-preprocess-dtype float16 \
--captioning-batch-size 8 \
--captioning-max-output-tokens 512 \
--generate-previews \
--preview-target-fps 1.0 \
--preview-target-height 240--captioning-algorithm accepts: qwen (default, Qwen2.5-VL-7B-Instruct), nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4. To enable caption enhancement with the Qwen LM, also pass --enhance-captions --enhance-captions-algorithm qwen_lm.
Preparation and previews
-
Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for the chosen VLM (Qwen‑VL or Nemotron), and optionally stores per‑window
mp4bytes for previews.from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage from nemo_curator.stages.video.preview.preview import PreviewStage prep = CaptionPreparationStage( model_variant="qwen", # or "nemotron" / "nemotron-fp8" / ... prompt_variant="default", prompt_text=None, sampling_fps=2.0, window_size=256, remainder_threshold=128, preprocess_dtype="float16", model_does_preprocess=False, generate_previews=True, verbose=True, ) -
Optionally generate
.webppreviews from each window’smp4bytes for quick QA and review.preview = PreviewStage( target_fps=1.0, target_height=240, verbose=True, )
Parameters
CaptionPreparationStage
| Parameter | Type | Default | Description |
|---|---|---|---|
model_variant | str | "qwen" | Vision‑language model used to format inputs. One of qwen, nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4. |
prompt_variant | av-surveillance | "default" | Built‑in prompt to steer caption content when prompt_text is not provided. |
prompt_text | str | None | None | Custom prompt text. When set, overrides prompt_variant. |
sampling_fps | float | 2.0 | Source sampling rate for creating per‑window inputs. |
window_size | int | 256 | Number of frames per window before captioning. |
remainder_threshold | int | 128 | Minimum leftover frames required to create a final shorter window. |
model_does_preprocess | bool | False | Whether the downstream model performs its own preprocessing. |
preprocess_dtype | str | "float32" | Data type for any preprocessing performed here. |
generate_previews | bool | True | When True, return per‑window mp4 bytes to enable preview generation. |
verbose | bool | False | Log additional setup and per‑clip details. |
PreviewStage
| Parameter | Type | Default | Description |
|---|---|---|---|
target_fps | float | 1.0 | Frames per second for preview encoding. |
target_height | int | 240 | Output height in pixels; width auto‑scales to preserve aspect ratio. |
compression_level | int (0–6) | 6 | WebP compression level (0 = lossless, higher = smaller files). |
quality | int (0–100) | 50 | WebP quality factor (100 = best quality, larger files). |
num_cpus_per_worker | float | 4.0 | CPU threads mapped to ffmpeg -threads for encoding. |
verbose | bool | False | Log warnings and per‑window encoding details. |
Caption generation and enhancement
-
Generate window‑level captions with the chosen VLM (Qwen‑VL or Nemotron). This stage reads
clip.windows[*].qwen_llm_input(created earlier) and writeswindow.caption["qwen"](orwindow.caption["nemotron"], depending on the variant).from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage gen = CaptionGenerationStage( model_dir="/models", model_variant="qwen", # or "nemotron" / "nemotron-fp8" / ... caption_batch_size=8, fp8=False, max_output_tokens=512, model_does_preprocess=False, generate_stage2_caption=False, stage2_prompt_text=None, disable_mmcache=True, ) -
Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads
window.caption["qwen"]and writeswindow.enhanced_caption["qwen_lm"].enh = CaptionEnhancementStage( model_dir="/models", model_variant="qwen", prompt_variant="default", prompt_text=None, model_batch_size=128, fp8=False, max_output_tokens=512, verbose=True, )
Parameters
CaptionGenerationStage
| Parameter | Type | Default | Description |
|---|---|---|---|
model_dir | str | "models/qwen" | Directory for model weights; downloaded on each node if missing. |
model_variant | str | "qwen" | Vision‑language model variant. One of qwen, nemotron, nemotron-bf16, nemotron-fp8, nemotron-nvfp4. |
caption_batch_size | int | 16 | Batch size for caption generation. |
fp8 | bool | False | Use FP8 weights when available. |
max_output_tokens | int | 512 | Maximum number of tokens to generate per caption. |
model_does_preprocess | bool | False | Whether the model performs its own preprocessing. |
disable_mmcache | bool | False | Disable multimodal cache for generation backends that support it. |
generate_stage2_caption | bool | False | Enable a second‑pass caption for refinement. |
stage2_prompt_text | str | None | None | Custom prompt for stage‑2 caption refinement. |
verbose | bool | False | Emit additional logs during generation. |
CaptionEnhancementStage
| Parameter | Type | Default | Description |
|---|---|---|---|
model_dir | str | "models/qwen" | Directory for language‑model weights; downloaded per node if missing. |
model_variant | qwen | "qwen" | Language‑model variant. |
prompt_variant | av-surveillance | "default" | Built‑in enhancement prompt when prompt_text is not provided. |
prompt_text | str | None | None | Custom enhancement prompt. When set, overrides prompt_variant. |
model_batch_size | int | 128 | Batch size for enhancement generation. |
fp8 | bool | False | Use FP8 weights when available. |
max_output_tokens | int | 512 | Maximum number of tokens to generate per enhanced caption. |
verbose | bool | False | Emit additional logs during enhancement. |
Preview Generation
Generate lightweight .webp previews for each caption window to support review and QA workflows. A dedicated PreviewStage reads per-window mp4 bytes and encodes WebP using ffmpeg.
Preview Parameters
target_fps(default1.0): Target frames per second for preview generation.target_height(default240): Output height. Width auto-scales to preserve aspect ratio.compression_level(range0–6, default6): WebP compression level.0is lossless; higher values reduce size with lower quality.quality(range0–100, default50): WebP quality. Higher values increase quality and size.num_cpus_per_worker(default4.0): Number of CPU threads mapped toffmpeg -threads.verbose(defaultFalse): Emit more logs.
Behavior notes:
- If the input frame rate is lower than
target_fpsor the input height is lower thantarget_height, the stage logs a warning and preview quality can degrade. - If
ffmpegfails, the stage logs the error and skips assigning preview bytes for that window.
Example: Configure PreviewStage
from nemo_curator.stages.video.preview.preview import PreviewStage
preview = PreviewStage(
target_fps=1.0,
target_height=240,
compression_level=6,
quality=50,
num_cpus_per_worker=4.0,
verbose=False,
)
Outputs
The stage writes .webp files under the previews/ directory that ClipWriterStage manages. Use the helper to resolve the path:
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")
Refer to Save & Export for directory structure and file locations: Save & Export.
Requirements and Troubleshooting
ffmpegwith WebP (libwebp) support must be available in the environment.- If you observe warnings about low frame rate or height, consider lowering
target_fpsortarget_heightto better match inputs. - On encoding errors, check logs for the
ffmpegcommand and output to diagnose missing encoders.