Skip to content

🎨 NeMo Data Designer

GitHub License NeMo Microservices

👋 Welcome! Data Designer is an orchestration framework for generating high-quality synthetic data. You provide LLM endpoints (NVIDIA, OpenAI, vLLM, etc.), and Data Designer handles batching, parallelism, validation, and more.

Configure columns and models → Preview samples and iterate → Create your full dataset at scale.

Unlike raw LLM calls, Data Designer gives you statistical diversity, field correlations, automated validation, and reproducible workflows. For details, see Architecture & Performance.

📝 Want to hear from the team? Check out our Dev Notes for deep dives, best practices, and insights.

Install

pip install data-designer

Setup

Get an API key from one of the default providers and set it as an environment variable:

# NVIDIA (build.nvidia.com) - recommended
export NVIDIA_API_KEY="your-api-key-here"

# OpenAI (platform.openai.com)
export OPENAI_API_KEY="your-openai-api-key-here"

# OpenRouter (openrouter.ai)
export OPENROUTER_API_KEY="your-openrouter-api-key-here"

Verify your configuration is ready:

data-designer config list

This displays the pre-configured model providers and models. See CLI Configuration to customize.

Your First Dataset

Let's generate multilingual greetings to see Data Designer in action:

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Initialize with default model providers
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()

# Add a sampler column to randomly select a language
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="language",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["English", "Spanish", "French", "German", "Italian"],
        ),
    )
)

# Add an LLM text generation column
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="greeting",
        model_alias="nvidia-text",
        prompt="Write a casual and formal greeting in {{ language }}.",
    )
)

# Generate a preview
results = data_designer.preview(config_builder)
results.display_sample_record()

🎉 That's it! You've just designed your first synthetic dataset.

🚀 Next Steps

  • Tutorials

    Step-by-step notebooks covering core features

  • Recipes

    Ready-to-use examples for common use cases

  • Concepts

    Deep dive into columns, models, and configuration

Learn More