Getting Started¶
NeMo Safe Synthesizer generates synthetic tabular data by fine-tuning a pretrained LLM on your dataset and sampling from the trained model. This page covers installation, a quick-start example, and a walkthrough of what the pipeline does at each stage.
Installation¶
Prerequisites¶
- Python 3.11–3.13 (dev tooling currently pins 3.11 via
.python-versionin the repo root; Python 3.14+ is not supported — see Troubleshooting) - CUDA runtime 12.8+
- NVIDIA GPU (A100 or larger) for training and generation
Linux only -- macOS, Windows, and Apple Silicon are not supported
NeMo Safe Synthesizer requires a Linux machine with an NVIDIA GPU and CUDA 12.8+ to run the training and generation pipeline. The CPU install tab below is for development and configuration validation only -- it cannot train models or generate synthetic data.
Install the Package¶
The CUDA and CPU extras depend on packages (PyTorch, FlashInfer) hosted on indexes outside PyPI. You must pass the extra index URLs shown below.
uv pip install "nemo-safe-synthesizer[cu128,engine]" \
--index https://flashinfer.ai/whl/cu128 \
--index https://download.pytorch.org/whl/cu128 \
--index-strategy unsafe-best-match
Why --index-strategy unsafe-best-match
FlashInfer publishes wheels to flashinfer.ai, but flashinfer-python
also appears on the PyTorch index at older versions. uv's default
first-match strategy stops at the first index that contains a
package name, so it picks up the wrong version from the PyTorch
index and fails to resolve. --index-strategy unsafe-best-match
tells uv to consider all indexes and pick the best matching version.
On macOS, PyTorch ships standard wheels on PyPI, so no extra indexes are needed.
On Linux, the CPU-only PyTorch wheels (+cpu local version) are hosted
on a separate PyTorch index.
Development use only
The CPU install does not support training or generation. Use it to validate configuration, explore the CLI, or import config classes in code. An A100 or larger GPU is required to run the full pipeline.
make container-build-gpu
docker run --gpus all --shm-size=1g \
-v $(pwd):/workspace \
-v ~/.cache/huggingface:/workspace/.hf_cache \
-e HF_HOME=/workspace/.hf_cache \
nss-gpu:latest run --config /workspace/config.yaml --data-source /workspace/data.csv
No local Python install needed. See Docker for full setup, volume mounts, and offline usage.
The bare package has no PyTorch or FlashInfer dependencies.
Limited use
The bare package includes only the Pydantic configuration models -- no
training, generation, or CLI engine. Use it in the NeMo Safe Synthesizer
Service or any Python project that needs to construct or validate
SafeSynthesizerParameters without pulling in the full ML stack.
Verify¶
After installing, activate your Python virtual environment and confirm the CLI is available:
Expected output:
Usage: safe-synthesizer [OPTIONS] COMMAND [ARGS]...
NeMo Safe Synthesizer command-line interface. This application is used to
run the Safe Synthesizer pipeline. It can be used to train a model, generate
synthetic data, and evaluate the synthetic data. It can also be used to
modify a config file.
Options:
--help Show this message and exit.
Commands:
artifacts Artifacts management commands.
config Manage Safe Synthesizer configurations.
run Run the Safe Synthesizer end-to-end pipeline.
Quick Start¶
Create a synthetic version of an input dataset in one step by running:
You can use Clinc OOS as an example dataset: export the split you want (for example the training split) to a CSV file, then point --data-source at that file:
Replace data.csv (or clinc_oos.csv) with your actual input file. Any .csv, .json, .jsonl,
.parquet, or .txt file works -- see Running Safe Synthesizer -- Data Input for all supported formats.
Your dataset should have at least 1,000 records (10,000 records if using differential privacy).
Use --log-format plain or set NSS_LOG_FORMAT=plain for more readable log output when using a non-interactive terminal (for example CI or captured logs). See Log format for details.
The command above fine-tunes a LoRA adapter on your data, generates 1000 synthetic records,
and produces an evaluation report. The default outputs are placed in
./safe-synthesizer-artifacts/<config>---<dataset>/<run_name>/
generate/synthetic_data.csv-- the synthetic datasetgenerate/evaluation_report.html-- quality and privacy scoresgenerate/evaluation_metrics.json-- machine-readable evaluation scores and timingtrain/adapter/-- trained adapter (reusable for more generation)
To run the same pipeline from Python, see Running Safe Synthesizer -- SDK.
→ Next step: read Evaluation to understand your first report and how to interpret SQS and DPS scores.
How the Pipeline Works¶
The pipeline has five stages. Each is independently configurable -- you can run the full pipeline in one step, or execute stages individually (train once, generate many times). You can find a brief overview of each stage below, or read Pipeline for in-depth descriptions.
flowchart LR
Data["Data Input"] --> PII["PII Replacement<br/>(on by default)"]
PII --> Train["Training"]
Train --> Gen["Generation"]
Gen --> Eval["Evaluation"]
1. Data Input¶
The pipeline loads your input data (CSV, JSON, JSONL, Parquet, or DataFrame) and prepares it for training:
- Column type inference and validation
- Grouping and ordering (if configured via
data.group_training_examples_byanddata.order_training_examples_by) - Train/test split -- a holdout set (default 5%) is reserved for evaluation
- Records are serialized to JSONL and tokenized; records that exceed the
model's context window raise a
GenerationErrorrather than being silently truncated
Your dataset should have at least 1,000 records (10,000 records if enabling differential privacy).
2. PII Replacement¶
PII replacement is on by default as a pre-processing step. The PII replacer detects personally identifiable information (PII) using GLiNER NER and optional LLM-based column classification, then replaces detected entities with synthetic but realistic values prior to fine-tuning. This ensures the model never has the opportunity to learn the most sensitive information (e.g. names, addresses, identifiers) from the training data. See Supported Entity Types for the full entity list.
See Configuration -- Replacing PII for entity types, LLM classification setup, and SDK customization.
3. Training¶
Fine-tunes a base LLM using LoRA (Low-Rank Adaptation). Two backends are available: Unsloth (default, faster) and HuggingFace (required for differential privacy). Both perform LoRA fine-tuning; see Running -- Training for a comparison.
The default model is HuggingFaceTB/SmolLM3-3B. Safe Synthesizer has tested support for HuggingFaceTB/SmolLM3-3B, TinyLlama/TinyLlama-1.1B-Chat-v1.0, and mistralai/Mistral-7B-Instruct-v0.3 (see Configuration -- Training for details on how to change the backend or model).
Training requires 1 NVIDIA GPU (A100 or larger) to run. Multi-GPU training is not supported.
Differential privacy
For formal privacy guarantees, enable Differentially Private Stochastic Gradient Descent (DP-SGD) when fine-tuning via privacy.dp_enabled: true. See Configuration -- Differential Privacy.
4. Generation¶
Produces synthetic records using the trained LoRA adapter via vLLM. The generation stage samples from the fine-tuned model until the requested number of valid records is reached, with configurable stopping conditions for quality control.
See Configuration -- Generation.
5. Evaluation¶
Measures quality and privacy of the synthetic data and produces an HTML report with interactive visualizations. Two composite scores are reported:
- SQS (Synthetic Quality Score) -- composite quality score with five subscores:
- Column Correlation Stability -- measures the correlation across every combination of two numeric and categorical columns
- Deep Structure Stability -- compares numeric and categorical columns in the training and synthetic data using Principal Component Analysis (PCA)
- Column Distribution Stability -- measures the distribution of each numeric and categorical column
- Text Structure Similarity -- measures the sentence, word, and character counts for text columns
- Text Semantic Similarity -- measures whether the semantic meaning in text columns held after synthesizing
- DPS (Data Privacy Score) -- composite privacy score with three subscores:
- Membership Inference Protection -- measures whether a model trained on the data can distinguish training records from held-out records
- Attribute Inference Protection -- measures whether an attacker can infer a sensitive attribute from quasi-identifiers in the synthetic data
- PII Replay Detection -- checks whether PII from training appears in synthetic data
See Evaluation for how to interpret scores.
What to Read Next¶
After your first run, read Evaluation to understand your SQS and DPS scores and Synthetic Data Quality to learn how to improve generation quality. If your job failed to run, read Troubleshooting to learn how to fix common errors.
Guides¶
-
Configuration
Synthesis parameters for training, generation, PII, DP, evaluation, and time series.
-
Running Safe Synthesizer
How to run the pipeline, CLI commands, individual stages, logging, and artifacts.
-
Environment Variables
Artifact paths, logging, model caching, NIM endpoints, and WandB.
-
Troubleshooting
Common errors, OOM fixes, offline setup, and configuration gotchas.
-
Synthetic Data Quality
Improving generation quality, the quality-privacy tradeoff, and unavailable metrics.