Data Synthesis¶
The synthesizer component is the core of the NeMo Safe Synthesizer product. It uses LLM-based fine-tuning to generate realistic synthetic data that maintains the utility of your original dataset while providing privacy protection.
How It Works¶
NeMo Safe Synthesizer employs a novel approach to synthetic data generation:
- Tabular Fine-Tuning: Fine-tunes a language model on your tabular data to learn patterns, correlations, and statistical properties
- Generation: Uses the fine-tuned model to generate new synthetic records that maintain data utility
- Privacy Protection: Optionally applies differential privacy during training for mathematical privacy guarantees
Creating synthetic versions of private data allows you to unlock insights without compromising privacy, enabling downstream use cases like AI model training and analytics.
Key Features¶
LLM-Based Fine-Tuning¶
NeMo Safe Synthesizer adapts language models to understand and generate tabular data:
- Converts tabular data into text sequences suitable for LLM training
- Fine-tunes on your dataset to capture patterns and correlations
- Generates new records that maintain statistical properties with no one-to-one mapping to original records
- Supports various model sizes and architectures
Two backends are available: Unsloth (default, faster) and HuggingFace (required for differential privacy). Both perform LoRA fine-tuning; see Running -- Training for a comparison.
Three models have been extensively tested:
| Family | HuggingFace ID |
|---|---|
| SmolLM3 (default) | HuggingFaceTB/SmolLM3-3B |
| TinyLlama | TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 |
For information on the trade-offs with model selection, see Training.
Supported Data Types¶
NeMo Safe Synthesizer supports diverse tabular data:
- Numeric: Continuous and discrete numerical values
- Categorical: Text labels and categories
- Text: Free-form text fields
- Temporal: Event sequences and time series (Note: Temporal dataset support is currently experimental)
Your dataset should have at least 1,000 records (10,000 records if enabling differential privacy).
Differential Privacy¶
A high level of privacy protection is achieved simply through the process of generating synthetic data, and is often a sufficient balance between privacy and utility. For use cases that require maximum privacy, you can fine-tune with differential privacy.
Differential privacy (DP) is the gold standard for privacy protection, providing mathematical guarantees that individual records cannot be identified. When you enable DP, NeMo Safe Synthesizer introduces calibrated noise during model training to ensure that the synthetic data generation process is provably private.
Mathematical Guarantee¶
DP ensures that the output of an algorithm is nearly identical whether or not any single record is included in the training data:
Where:
- \(M\) is the mechanism (process of training the model)
- \(D_1\) and \(D_2\) are datasets differing by one record
- \(\varepsilon\) (epsilon) controls privacy loss -- lower values provide stronger privacy
- \(\delta\) (delta) is the failure probability
- \(S\) is any subset of possible outputs
DP-SGD Implementation¶
NeMo Safe Synthesizer uses Differentially Private Stochastic Gradient Descent (DP-SGD) to add privacy guarantees during model training:
- Per-sample gradient computation: Calculate gradients for each training example individually
- Gradient clipping: Clip L2 norm to
per_sample_max_grad_normto bound sensitivity - Noise injection: Add calibrated Gaussian noise to gradients based on privacy budget
- Privacy accounting: Track cumulative privacy loss using Rényi Differential Privacy (RDP)
By default, record-level differential privacy is used. When group_training_examples_by is set, group-level privacy applies -- guarantees cover entire groups of records rather than individual records.
Privacy vs Utility Trade-off¶
Enabling DP provides strong privacy guarantees but affects synthetic data quality:
- Lower epsilon: stronger privacy, but more noise and potentially lower utility
- Training speed: DP training is usually 2-4x slower due to per-sample gradient computation
- Data requirements: DP works best with larger datasets (10,000+ records recommended, compared to 1,000+ records without DP)
- Quality impact: The added noise may reduce statistical fidelity of synthetic data
Configuration Parameters¶
For the complete list of configuration parameters, see the Parameters Reference. Some commonly used parameters are listed below:
| Parameter | Type | Default | Description |
|---|---|---|---|
dp_enabled |
bool | false |
Enable differential privacy |
epsilon |
float | 8.0 |
Privacy budget (lower = more private, typical range: 4-12) |
delta |
float/auto | "auto" |
Failure probability (auto = 1/n^1.2 based on dataset size) |
per_sample_max_grad_norm |
float | 1.0 |
Gradient clipping threshold |
Recommendations¶
Starting point: Begin with \(\varepsilon \in [8, 12]\) and reduce as needed based on privacy requirements and acceptable quality trade-offs.
Delta calculation: Use "auto" (recommended), which sets \(\delta = \frac{1}{n^{1.2}}\) based on dataset size \(n\). Manual values are typically between 1e-6 and 1e-4.
Data size: DP performs best with 10,000+ training records. Smaller datasets may experience significant quality degradation due to the noise required for privacy guarantees.
For hands-on guidance, see Running Safe Synthesizer -- Differential Privacy and Configuration -- Differential Privacy. For complete parameter documentation, refer to Parameters Reference.
Configuration¶
Synthesis behavior is controlled through configuration parameters:
- Training: Model selection, training parameters, sequence configuration
- Generation: Number of records, temperature, sampling strategies
- Privacy: Differential privacy parameters (epsilon, delta, clipping)
For a complete list of all available parameters and their defaults, refer to Parameters Reference.
Related Topics¶
- Pipeline: Pipeline overview and stages
- Getting Started: Install and run your first pipeline
- Configuration: Complete parameter reference