🎨 NeMo Data Designer
👋 Welcome! Data Designer is an orchestration framework for generating high-quality synthetic data. You provide LLM endpoints (NVIDIA, OpenAI, vLLM, etc.), and Data Designer handles batching, parallelism, validation, and more.
Configure columns and models → Preview samples and iterate → Create your full dataset at scale.
Unlike raw LLM calls, Data Designer gives you statistical diversity, field correlations, automated validation, and reproducible workflows. For details, see Architecture & Performance.
📝 Want to hear from the team? Check out our Dev Notes for deep dives, best practices, and insights.
Install
pip install data-designer
Setup
Get an API key from one of the default providers and set it as an environment variable:
# NVIDIA (build.nvidia.com) - recommended
export NVIDIA_API_KEY="your-api-key-here"
# OpenAI (platform.openai.com)
export OPENAI_API_KEY="your-openai-api-key-here"
# OpenRouter (openrouter.ai)
export OPENROUTER_API_KEY="your-openrouter-api-key-here"
Verify your configuration is ready:
data-designer config list
This displays the pre-configured model providers and models. See CLI Configuration to customize.
Your First Dataset
Let's generate multilingual greetings to see Data Designer in action:
import data_designer.config as dd
from data_designer.interface import DataDesigner
# Initialize with default model providers
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
# Add a sampler column to randomly select a language
config_builder.add_column(
dd.SamplerColumnConfig(
name="language",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(
values=["English", "Spanish", "French", "German", "Italian"],
),
)
)
# Add an LLM text generation column
config_builder.add_column(
dd.LLMTextColumnConfig(
name="greeting",
model_alias="nvidia-text",
prompt="Write a casual and formal greeting in {{ language }}.",
)
)
# Generate a preview
results = data_designer.preview(config_builder)
results.display_sample_record()
🎉 That's it! You've just designed your first synthetic dataset.
🚀 Next Steps
Learn More
- Deployment Options – Library vs. NeMo Microservice
- Model Configuration – Configure LLM providers and models
- Architecture & Performance – Optimize for throughput and scale