Data Export Concepts (Image)
This page covers the core concepts for saving and exporting curated image datasets in NeMo Curator.
Key Topics
- Saving curated images and metadata
- Understanding output format structure
- Configuring output sharding
- Preparing data for downstream training or analysis
Saving Results
After processing through the pipeline, you can save the curated images and metadata using the ImageWriterStage.
Example:
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Add writer stage to pipeline
pipeline.add_stage(ImageWriterStage(
output_dir="/output/curated_dataset",
images_per_tar=1000, # Images per tar file
remove_image_data=True,
verbose=True,
deterministic_name=True, # Use deterministic naming for reproducible output
))
Key Parameters:
output_dir: Directory where tar archives and metadata files are writtenimages_per_tar: Number of images per tar file for optimal shardingremove_image_data: Whether to remove image data from memory after writingdeterministic_name: Ensures reproducible file naming based on input content
Behavior:
- The writer stage creates tar files with curated images
- Metadata for each image (including paths, IDs, scores, and processing metadata) is always stored in separate Parquet files alongside tar archives
- Adjust
images_per_tarto balance I/O, parallelism, and storage efficiency - Smaller values create more files but enable better parallelism
- Larger values reduce file count but may impact loading performance
Output Format
The ImageWriterStage creates tar archives containing curated images with accompanying metadata files:
Output Structure:
output/
├── images-{hash}-000000.tar # Contains JPEG images
├── images-{hash}-000000.parquet # Metadata for corresponding tar
├── images-{hash}-000001.tar
├── images-{hash}-000001.parquet
Format Details:
- Tar contents: JPEG images with sequential or ID-based filenames
- Metadata storage: Separate Parquet files containing image paths, IDs, and processing metadata
- Naming: Deterministic or random naming based on configuration
- Sharding: Configurable number of images per tar file for optimal performance
Preparing for Downstream Use
- Ensure your exported dataset matches the requirements of your training or analysis pipeline.
- Use consistent naming and metadata fields for compatibility.
- Document any filtering or processing steps for reproducibility.
- Test loading the exported dataset before large-scale training.