Streaming vs. Batch Processing
Different inference stages have different compute requirements. NeMo Curator uses Ray streaming to increase GPU utilization and processing speed compared to traditional batch-all-at-once approaches.
Batch Mode vs. Streaming Mode
Batch Mode
In batch mode, each stage processes the entire dataset before the next stage begins. Stages with different compute requirements (CPU-only tokenization, single-GPU classifiers, multi-GPU encoders) all run sequentially:
| Aspect | Batch Mode |
|---|---|
| Execution | One stage at a time across the full dataset |
| Memory usage | Proportional to dataset size |
| GPU utilization | Low — GPUs idle while CPU stages run, and vice versa |
| Time to first output | After the entire pipeline finishes |
Streaming Mode
In streaming mode, data flows through the pipeline as discrete batches. Each stage processes its current batch and immediately passes it downstream, so all stages run concurrently on different batches:
| Aspect | Streaming Mode |
|---|---|
| Execution | All stages active simultaneously on different batches |
| Memory usage | Constant (proportional to batch size, not dataset size) |
| GPU utilization | High — stages with different hardware needs overlap |
| Time to first output | After the first batch completes the pipeline |
Why Streaming Is Faster
Streaming with heterogeneous compute allows NeMo Curator to overlap stages that use different resources. For example, while a GPU inference stage processes batch N, a CPU tokenization stage can process batch N+1 simultaneously — neither blocks the other.
This overlap improves throughput in pipelines that mix CPU and GPU work, because both happen in parallel rather than taking turns.
Combined with auto-balancing, streaming enables Curator to rearrange resources so that GPU stage workers are kept busy over 99% of the time after an initial warm-up period.
Heterogeneous Executors
NeMo Curator supports streaming with multiple executors — Cosmos Xenna, Ray Data, and others — each optimized for different workload patterns. The executor handles scheduling, backpressure, and resource allocation so that streaming “just works” regardless of how many stages your pipeline has.
Configuring Batch Size
Batch size controls the trade-off between memory usage and throughput:
from nemo_curator.stages.resources import Resources
# Configure batch size on a stage
word_count_stage = WordCountStage().with_(batch_size=128)
- Smaller batches: Lower memory usage per batch. Ray may handle smaller batches more efficiently in some workloads.
- Larger batches: More in-memory data per batch, which can reduce I/O overhead but uses more memory.
When Streaming Matters Most
- Datasets exceed memory. A 10 TB Common Crawl snapshot cannot fit in RAM, but it can be streamed in manageable chunks.
- Pipeline stages have different hardware needs. CPU-only and GPU stages overlap instead of taking turns.
- You need early results. Inspect output from the first batch while the rest of the dataset is still processing.