Skip to content

The Basics

This tutorial demonstrates the fundamentals of Data Designer by generating a product review dataset.

For more detail about column behavior, see the open-source library's version of this tutorial.

Prerequisites

Ensure you have completed the tutorials prerequisites. This tutorial uses an Inference Gateway provider, so local CLI run and NeMo Services execution both need access to the Inference Gateway API in a running NeMo Services cluster.

Part 1: Build the Configuration

Use the data_designer.config package to define your dataset schema. This configuration code is the same across the plugin execution modes.

Tip

Build the configuration once, then choose whether to execute with CLI run, CLI submit, or the SDK.

Define Models

Start by defining the models you want to use:

import data_designer.config as dd

MODEL_ALIAS = "text"

model_configs = [
    dd.ModelConfig(
        provider="default/nvidia-build",
        model="nvidia/nemotron-3-nano-30b-a3b",  # Use the `served_model_name` from the provider
        alias=MODEL_ALIAS,
        inference_parameters=dd.ChatCompletionInferenceParams(
            temperature=1.0,
            top_p=1.0,
        ),
    )
]

config_builder = dd.DataDesignerConfigBuilder(model_configs)

Add Columns

Define the columns for your dataset. The library documentation explains these column types in detail.

# Product category sampler
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home & Kitchen",
                "Books",
                "Home Office",
            ],
        ),
    )
)

# Product subcategory sampler (conditional on category)
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="product_subcategory",
        sampler_type=dd.SamplerType.SUBCATEGORY,
        params=dd.SubcategorySamplerParams(
            category="product_category",
            values={
                "Electronics": [
                    "Smartphones",
                    "Laptops",
                    "Headphones",
                    "Cameras",
                    "Accessories",
                ],
                "Clothing": [
                    "Men's Clothing",
                    "Women's Clothing",
                    "Winter Coats",
                    "Activewear",
                    "Accessories",
                ],
                "Home & Kitchen": [
                    "Appliances",
                    "Cookware",
                    "Furniture",
                    "Decor",
                    "Organization",
                ],
                "Books": [
                    "Fiction",
                    "Non-Fiction",
                    "Self-Help",
                    "Textbooks",
                    "Classics",
                ],
                "Home Office": [
                    "Desks",
                    "Chairs",
                    "Storage",
                    "Office Supplies",
                    "Lighting",
                ],
            },
        ),
    )
)

# Target age range
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="target_age_range",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["18-25", "25-35", "35-50", "50-65", "65+"]
        ),
    )
)

# Customer details using Faker
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="customer",
        sampler_type=dd.SamplerType.PERSON_FROM_FAKER,
        params=dd.PersonFromFakerSamplerParams(age_range=[18, 70], locale="en_US"),
    )
)

# Star rating
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="number_of_stars",
        sampler_type=dd.SamplerType.UNIFORM,
        params=dd.UniformSamplerParams(low=1, high=5),
        convert_to="int",  # Convert the sampled float to an integer
    )
)

# Review style
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="review_style",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=["rambling", "brief", "detailed", "structured with bullet points"],
            weights=[1, 2, 2, 1],
        ),
    )
)

# LLM-generated product name
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="product_name",
        prompt=(
            "You are a helpful assistant that generates product names. DO NOT add quotes around the product name.\n\n"
            "Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
            "on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
            "{{ target_age_range }} years old. Respond with only the product name, no other text."
        ),
        model_alias=MODEL_ALIAS,
    )
)

# LLM-generated customer review
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
            "The style of the review should be '{{ review_style }}'. "
            "Respond with only the review, no other text."
        ),
        model_alias=MODEL_ALIAS,
    )
)

Part 2: Execute

Now execute your configuration. You can run locally through the CLI, submit to NeMo Services, or call the Data Designer API from the SDK.

Local CLI Execution

Save the configuration in a Python file such as product_reviews.py and expose a load_config_builder() function that returns the config_builder.

def load_config_builder() -> dd.DataDesignerConfigBuilder:
    return config_builder

Preview locally:

nemo data-designer preview run product_reviews.py --num-records 5

Generate a larger dataset locally:

nemo data-designer create run product_reviews.py --num-records 30

This workload runs in the local CLI process, but because the configuration references default/nvidia-build, it still communicates with the Inference Gateway API.

NeMo Services CLI Execution

Submit the same configuration to NeMo Services when you want service-managed execution:

nemo data-designer preview submit product_reviews.py --workspace default --num-records 5
nemo data-designer create submit product_reviews.py --workspace default --profile default --num-records 30

SDK Data Designer API Execution

The DataDesignerResource is your SDK interface for Data Designer API execution. You can access it from an existing SDK instance:

import os
from nemo_platform import NeMoPlatform

base_url = os.environ.get("NMP_BASE_URL", "http://localhost:8080")
client = NeMoPlatform(base_url=base_url, workspace="default")

data_designer = client.data_designer

Previewing the Dataset

Use the preview method for API-backed rapid iteration. Generate a small sample, inspect the results, adjust your configuration, and repeat:

preview = data_designer.preview(config_builder)

# Display a random sample record
preview.display_sample_record()

# Access the full preview dataset as a pandas DataFrame
df = preview.dataset
print(df.head())

# View statistical analysis
preview.analysis.to_report()
More about preview results

The PreviewResults object returned by client.data_designer.preview stores all its fields in memory; nothing is persisted to disk by default. Use standard Python methods to save any preview data you want to keep around longer term. For example, the dataset is a regular Pandas DataFrame and can be saved to disk via methods like to_csv or to_parquet.

Iterate: Adjust column configurations, prompts, or parameters in your config_builder, then run preview again until you're satisfied with the results.

Scaling Up with Jobs

When you're happy with the preview, create a larger service-managed generation job:

# Defaulting to 30 for demo speed purposes. Happy with the output? Scale it up!
job = data_designer.create(config_builder, num_records=30)

# Block until the job completes
job.wait_until_done()

# Download the generated artifacts
results = job.download_artifacts()

# Load the dataset as a pandas DataFrame
dataset = results.load_dataset()
print(dataset.head())

# Load the full analysis report
analysis = results.load_analysis()
analysis.to_report()
More about job results

The Data Designer library writes several artifacts to disk when running a full generation job, including the final dataset as parquet. When a Data Designer job runs through NeMo Services, the entire working directory of artifacts produced by the library is saved as a job result. The download_artifacts method downloads this artifacts directory (stored as a .tar.gz archive), unarchives it, and returns a DataDesignerJobResults object that can be used to load results into memory as DataFrames or other objects for programmatic inspection.

By default, download_artifacts saves the artifacts to a relative local directory named after the job. An alternative path can be passed to download_artifacts.

What Happens Under the Hood

When you use CLI run:

  1. Local Execution: The Data Designer workload runs in the CLI process.
  2. Resource Resolution: The workload can use local resources, NeMo resources, or both.
  3. Generation: Data Designer resolves dependencies and generates records in the local environment.

When you use CLI submit or the SDK today:

  1. Configuration Validation: The service validates your configuration and resolves column dependencies
  2. NeMo Services Execution: Preview runs through the Data Designer API; create runs as a service-managed job
  3. Inference Routing: LLM calls are routed through Inference Gateway to your configured model providers
  4. Artifact Storage: Job datasets and analysis reports are stored in job artifact storage
  5. Job Completion: You can monitor job status and load results when complete

Next Steps