LLM Client Configuration
NeMo Curator’s synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints.
Overview
Two client types are available:
AsyncOpenAIClient: Recommended for high-throughput batch processing with concurrent requestsOpenAIClient: Synchronous client for simpler use cases or debugging
For most SDG workloads, use AsyncOpenAIClient to maximize throughput.
Basic Configuration
NVIDIA API Endpoints
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
client = AsyncOpenAIClient(
api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5,
)
Environment Variables
Set your API key as an environment variable to avoid hardcoding credentials:
export NVIDIA_API_KEY="nvapi-..."
The underlying OpenAI client automatically uses the OPENAI_API_KEY environment variable if no api_key is provided. For NVIDIA APIs, explicitly pass the key:
import os
client = AsyncOpenAIClient(
api_key=os.environ["NVIDIA_API_KEY"],
base_url="https://integrate.api.nvidia.com/v1",
)
Generation Parameters
Configure LLM generation behavior using GenerationConfig:
from nemo_curator.models.client.llm_client import GenerationConfig
config = GenerationConfig(
max_tokens=2048,
temperature=0.7,
top_p=0.95,
seed=42, # For reproducibility
)
| Parameter | Type | Default | Description |
|---|---|---|---|
max_tokens | int | 2048 | Maximum tokens to generate per request |
temperature | float | 0.0 | Sampling temperature (0.0-2.0). Higher values increase randomness |
top_p | float | 0.95 | Nucleus sampling parameter (0.0-1.0) |
top_k | int | None | Top-k sampling (if supported by the endpoint) |
seed | int | 0 | Random seed for reproducibility |
stop | str/list | None | Stop sequences to end generation |
stream | bool | False | Enable streaming (not recommended for batch processing) |
n | int | 1 | Number of completions to generate per request |
extra_kwargs | dict | None | Additional keyword arguments passed through to the OpenAI create() call |
Performance Tuning
Concurrency vs. Parallelism
The max_concurrent_requests parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray’s distributed workers:
- Client-level concurrency:
max_concurrent_requestslimits concurrent API calls per worker - Worker-level parallelism: Ray distributes tasks across multiple workers
# For NVIDIA API endpoints with rate limits
client = AsyncOpenAIClient(
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=3, # Conservative for cloud APIs
)
Retry Configuration
The client includes automatic retry with exponential backoff for transient errors:
client = AsyncOpenAIClient(
base_url="https://integrate.api.nvidia.com/v1",
max_retries=3, # Number of retry attempts
base_delay=1.0, # Base delay in seconds
timeout=120, # Request timeout
)
The retry logic handles:
- Rate limit errors (429): Automatic backoff with jitter
- Connection errors: Retry with exponential delay
- Transient failures: Configurable retry attempts
Using Other OpenAI-Compatible Endpoints
The AsyncOpenAIClient works with any OpenAI-compatible API endpoint. Simply configure the base_url and api_key parameters:
# OpenAI API
client = AsyncOpenAIClient(
base_url="https://api.openai.com/v1",
api_key="sk-...", # Or set OPENAI_API_KEY env var
max_concurrent_requests=5,
)
# Any OpenAI-compatible endpoint
client = AsyncOpenAIClient(
base_url="http://your-endpoint/v1",
api_key="your-api-key",
max_concurrent_requests=5,
)
Local Inference with InferenceServer
To serve models locally and connect them to AsyncOpenAIClient, use NeMo Curator’s built-in Inference Server (Ray Serve + vLLM):
from nemo_curator.core.serve import InferenceModelConfig, InferenceServer
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
config = InferenceModelConfig(
model_identifier="meta-llama/Llama-3-8B-Instruct",
engine_kwargs={"tensor_parallel_size": 2},
)
with InferenceServer(models=[config]) as server:
client = AsyncOpenAIClient(
base_url=server.endpoint,
api_key="unused",
max_concurrent_requests=10,
)
# Use client in pipeline stages
Complete Example
import os
from nemo_curator.models.client.openai_client import AsyncOpenAIClient
from nemo_curator.models.client.llm_client import GenerationConfig
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage
# Configure client
client = AsyncOpenAIClient(
api_key=os.environ.get("NVIDIA_API_KEY"),
base_url="https://integrate.api.nvidia.com/v1",
max_concurrent_requests=5,
max_retries=3,
base_delay=1.0,
)
# Configure generation
config = GenerationConfig(
temperature=0.9,
top_p=0.95,
max_tokens=2048,
)
# Use in a pipeline stage
pipeline = Pipeline(name="sdg_example")
pipeline.add_stage(
QAMultilingualSyntheticStage(
prompt="Generate a Q&A pair about science in {language}.",
languages=["English", "French", "German"],
client=client,
model_name="meta/llama-3.3-70b-instruct",
num_samples=100,
generation_config=config,
)
)
Troubleshooting
Rate Limit Errors
If you encounter frequent 429 errors:
- Reduce
max_concurrent_requests - Increase
base_delayfor longer backoff - Consider using a local deployment for high-volume workloads
Connection Timeouts
For slow networks or high-latency endpoints:
client = AsyncOpenAIClient(
base_url="...",
timeout=300, # Increase from default 120 seconds
)
Next Steps
- Multilingual Q&A: Generate multilingual Q&A pairs
- Nemotron-CC: Advanced text transformation pipelines