Person Sampling in Data Designer
Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases.
Overview
Data Designer provides two ways to generate synthetic people:
- Faker-based sampling - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
- Nemotron-Personas datasets - Demographically accurate, rich persona data
Approach 1: Faker-Based Sampling
What It Does
Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.
Features
- Gives you access to person attributes that Faker exposes
- Quick to set up with no additional downloads
- Generates random names, emails, addresses, phone numbers, etc.
- Supports all Faker-supported locales
- Not demographically grounded - data patterns don't reflect real-world demographics
Usage Example
from data_designer.essentials import (
SamplerColumnConfig,
SamplerType,
PersonFromFakerSamplerParams,
)
config_builder.add_column(
SamplerColumnConfig(
name="customer",
sampler_type=SamplerType.PERSON_FROM_FAKER,
params=PersonFromFakerSamplerParams(
locale="en_US",
age_range=[25, 65],
sex="Female",
),
)
)
For mor details, see the documentation for SamplerColumnConfig and PersonFromFakerSamplerParams.
Approach 2: Nemotron-Personas Datasets
What It Does
Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
The NGC datasets are extended versions of the open-source Nemotron-Personas datasets on HuggingFace, with additional fields and enhanced data quality.
Supported locales:
en_US: United Statesja_JP: Japanen_IN: Indiahi_Deva_IN: India (Devanagari script)hi_Latn_IN: India (Latin script)
Features
- Demographically accurate personal details: Names, ages, sex, marital status, education, occupation based on census data
- Rich persona details: Comprehensive behavioral profiles including:
- Big Five personality traits with scores
- Cultural backgrounds and narratives
- Skills and hobbies
- Career goals and aspirations
- Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.)
- Consistent, referenceable attributes across your dataset
- Grounded in real-world demographic distributions
Prerequisites
To use the extended Nemotron-Personas datasets with Data Designer, you need to download them from NGC and move them to the Data Designer managed assets directory.
See below for step-by-step instructions.
Nemotron-Personas Datasets Setup Instructions
Step 0: Obtain an NGC API Key and install the NGC CLI
To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI.
- NGC API Key: Obtain from NVIDIA GPU Cloud
- NGC CLI: NGC CLI
Step 1: Set Your NGC API Key
export NGC_API_KEY="your-ngc-api-key-here"
Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI
Once you have the NGC CLI and your NGC API key set up, you can download the datasets via the Data Designer CLI.
You can pass the locales you want to download as arguments to the CLI command:
data-designer download personas --locale en_US --locale ja_JP
Or you can use the interactive mode to select the locales you want to download:
data-designer download personas
Step 2 (option 2): Download Nemotron-Personas Datasets Directly
Use the NGC CLI to download the datasets:
# For Nemotron-Personas USA
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
# For Nemotron-Personas IN
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
# For Nemotron-Personas JP
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
Then move the downloaded dataset to the Data Designer managed assets directory:
mkdir -p ~/.data-designer/managed-assets/datasets/
mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/
Step 3: Use PersonSampler in Your Code
from data_designer.essentials import (
SamplerColumnConfig,
SamplerType,
PersonSamplerParams,
)
config_builder.add_column(
SamplerColumnConfig(
name="customer",
sampler_type=SamplerType.PERSON,
params=PersonSamplerParams(
locale="en_US",
sex="Female",
age_range=[25, 45],
with_synthetic_personas=True,
),
)
)
For more details, see the documentation for SamplerColumnConfig and PersonSamplerParams.
Available Data Fields
Core Fields (all locales):
| Field | Type | Notes |
|---|---|---|
uuid |
UUID | Unique identifier |
first_name |
string | |
middle_name |
string | |
last_name |
string | |
sex |
enum | "Male" or "Female" |
birth_date |
date | Derived: year, month, day |
street_number |
int | |
street_name |
string | |
unit |
string | Address line 2 |
city |
string | |
region |
string | Alias: state |
district |
string | Alias: county |
postcode |
string | Alias: zipcode |
country |
string | |
phone_number |
PhoneNumber | Derived: area_code, country_code, prefix, line_number |
marital_status |
string | Values: never_married, married_present, separated, widowed, divorced |
education_level |
string or None | |
bachelors_field |
string or None | |
occupation |
string or None | |
email_address |
string | |
national_id |
string |
Japan-Specific Fields (ja_JP):
area
India-Specific Fields (en_IN, hi_IN, hi_Deva_IN, hi_Latn_IN):
religion- Census-reported religioneducation_degree- Census-reported education degreefirst_language- Native languagesecond_language- Second language (if applicable)third_language- Third language (if applicable)zone- Urban vs rural
With Synthetic Personas Enabled:
- Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
- Cultural background narratives
- Skills and competencies
- Hobbies and interests
- Career goals
- Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.)
Configuration Parameters
| Parameter | Type | Description |
|---|---|---|
locale |
str | Language/region code - must be one of: "en_US", "ja_JP", "en_IN", "hi_Deva_IN", "hi_Latn_IN" |
sex |
str (optional) | Filter by "Male" or "Female" |
city |
str or list[str] (optional) | Filter by specific city or cities within locale |
age_range |
list[int] (optional) | Two-element list [min_age, max_age] (default: [18, 114]) |
with_synthetic_personas |
bool (optional) | Include rich personality profiles (default: False) |
select_field_values |
dict (optional) | Custom field-based filtering (e.g., {"state": ["NY", "CA"], "education_level": ["bachelors"]}) |