Text Embedding
Generate text embeddings for large-scale datasets using NeMo Curator’s built-in embedding stages. Text embeddings enable downstream tasks such as semantic deduplication, similarity search, and clustering.
How It Works
NeMo Curator provides three embedding backends for text data, each suited to different model sizes and throughput requirements:
EmbeddingCreatorStage— A composite stage that handles tokenization and embedding in sequence. Supports both Sentence Transformers’SentenceTransformerand Hugging Face’sAutoModelclasses via theuse_sentence_transformerflag.VLLMEmbeddingModelStage— A standalone stage that uses vLLM for GPU-accelerated embedding generation with optional pretokenization. Best for large embedding models where vLLM’s batching and GPU utilization provide significant throughput gains.SentenceTransformerEmbeddingModelStage— A model stage that uses thesentence-transformerslibrary directly. Used internally byEmbeddingCreatorStagewhenuse_sentence_transformer=True.
Choosing an Embedding Backend
| Backend | Best For | GPU Utilization | Setup |
|---|---|---|---|
EmbeddingCreatorStage (Sentence Transformers) | Small to medium models (e.g., all-MiniLM-L6-v2) | Good | Included in text_cuda12 extra |
VLLMEmbeddingModelStage | Large models (e.g., google/embeddinggemma-300m) and semantic deduplication | Excellent | Included in text_cuda12 extra |
EmbeddingCreatorStage (AutoModel) | Custom pooling strategies | Good | Set use_sentence_transformer=False |
Quick Start
EmbeddingCreatorStage
from nemo_curator.backends.xenna import XennaExecutor
from nemo_curator.stages.text.embedders import EmbeddingCreatorStage
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import ParquetReader
from nemo_curator.stages.text.io.writer import ParquetWriter
pipeline = Pipeline(
name="text_embeddings",
stages=[
ParquetReader(file_paths="input_data/", files_per_partition=1, fields=["text"]),
EmbeddingCreatorStage(
model_identifier="sentence-transformers/all-MiniLM-L6-v2",
text_field="text",
embedding_field="embeddings",
model_inference_batch_size=256,
),
ParquetWriter(path="output/", fields=["text", "embeddings"]),
],
)
executor = XennaExecutor()
pipeline.run(executor)
VLLMEmbeddingModelStage (Recommended for Semantic Deduplication)
VLLMEmbeddingModelStage is the default embedding backend for semantic deduplication, using google/embeddinggemma-300m. It provides better GPU utilization and throughput for large embedding models. See the vLLM Embedder guide for setup, configuration, and code examples.
Available Embedding Tools
Integration with Semantic Deduplication
Text embeddings are a key input for semantic deduplication. The TextSemanticDeduplicationWorkflow uses VLLMEmbeddingModelStage internally, but you can also generate embeddings separately and feed them into the deduplication workflow for more control over the embedding process.