Getting Started¶

NeMo Safe Synthesizer generates synthetic tabular data by fine-tuning a pretrained LLM on your dataset and sampling from the trained model. This page covers installation, a quick-start example, and a walkthrough of what the pipeline does at each stage.

Installation¶

Prerequisites¶

Python 3.11–3.13 (dev tooling currently pins 3.11 via .python-version in the repo root; Python 3.14+ is not supported — see Troubleshooting)
CUDA runtime 12.8+
NVIDIA GPU (A100 or larger) for training and generation

Linux only -- macOS, Windows, and Apple Silicon are not supported

NeMo Safe Synthesizer requires a Linux machine with an NVIDIA GPU and CUDA 12.8+ to run the training and generation pipeline. The CPU install tab below is for development and configuration validation only -- it cannot train models or generate synthetic data.

Install the Package¶

The CUDA and CPU extras depend on packages (PyTorch, FlashInfer) hosted on indexes outside PyPI. You must pass the extra index URLs shown below.

CUDA 12.8 (Linux with NVIDIA GPU)CPU (macOS / Linux without GPU)Docker (Linux with NVIDIA GPU)Bare package for config definitions

pipuv

pip install "nemo-safe-synthesizer[cu128,engine]" \
  --extra-index-url https://download.pytorch.org/whl/cu128 \
  --extra-index-url https://flashinfer.ai/whl/cu128

uv pip install "nemo-safe-synthesizer[cu128,engine]" \
  --index https://flashinfer.ai/whl/cu128 \
  --index https://download.pytorch.org/whl/cu128 \
  --index-strategy unsafe-best-match

Why --index-strategy unsafe-best-match

FlashInfer publishes wheels to flashinfer.ai, but flashinfer-python also appears on the PyTorch index at older versions. uv's default first-match strategy stops at the first index that contains a package name, so it picks up the wrong version from the PyTorch index and fails to resolve. --index-strategy unsafe-best-match tells uv to consider all indexes and pick the best matching version.

On macOS, PyTorch ships standard wheels on PyPI, so no extra indexes are needed.

On Linux, the CPU-only PyTorch wheels (+cpu local version) are hosted on a separate PyTorch index.

pip (macOS)pip (Linux)uv (macOS)uv (Linux)

pip install "nemo-safe-synthesizer[cpu,engine]"

pip install "nemo-safe-synthesizer[cpu,engine]" \
  --extra-index-url https://download.pytorch.org/whl/cpu

uv pip install "nemo-safe-synthesizer[cpu,engine]"

uv pip install "nemo-safe-synthesizer[cpu,engine]" \
  --index https://download.pytorch.org/whl/cpu

Development use only

The CPU install does not support training or generation. Use it to validate configuration, explore the CLI, or import config classes in code. An A100 or larger GPU is required to run the full pipeline.

make container-build-gpu

docker run --gpus all --shm-size=1g \
  -v $(pwd):/workspace \
  -v ~/.cache/huggingface:/workspace/.hf_cache \
  -e HF_HOME=/workspace/.hf_cache \
  nss-gpu:latest run --config /workspace/config.yaml --data-source /workspace/data.csv

No local Python install needed. See Docker for full setup, volume mounts, and offline usage.

The bare package has no PyTorch or FlashInfer dependencies.

pipuv

pip install "nemo-safe-synthesizer"

uv pip install "nemo-safe-synthesizer"

Limited use

The bare package includes only the Pydantic configuration models -- no training, generation, or CLI engine. Use it in the NeMo Safe Synthesizer Service or any Python project that needs to construct or validate SafeSynthesizerParameters without pulling in the full ML stack.

Verify¶

After installing, activate your Python virtual environment and confirm the CLI is available:

safe-synthesizer --help

Expected output:

Usage: safe-synthesizer [OPTIONS] COMMAND [ARGS]...

  NeMo Safe Synthesizer command-line interface. This application is used to
  run the Safe Synthesizer pipeline. It can be used to train a model, generate
  synthetic data, and evaluate the synthetic data. It can also be used to
  modify a config file.

Options:
  --help  Show this message and exit.

Commands:
  artifacts  Artifacts management commands.
  config     Manage Safe Synthesizer configurations.
  run        Run the Safe Synthesizer end-to-end pipeline.

Quick Start¶

Create a synthetic version of an input dataset in one step by running:

safe-synthesizer run --data-source data.csv

You can use Clinc OOS as an example dataset: export the split you want (for example the training split) to a CSV file, then point --data-source at that file:

safe-synthesizer run --data-source clinc_oos.csv

Replace data.csv (or clinc_oos.csv) with your actual input file. Any .csv, .json, .jsonl, .parquet, or .txt file works -- see Running Safe Synthesizer -- Data Input for all supported formats.

Your dataset should have at least 1,000 records (10,000 records if using differential privacy).

Use --log-format plain or set NSS_LOG_FORMAT=plain for more readable log output when using a non-interactive terminal (for example CI or captured logs). See Log format for details.

The command above fine-tunes a LoRA adapter on your data, generates 1000 synthetic records, and produces an evaluation report. The default outputs are placed in ./safe-synthesizer-artifacts/<config>---<dataset>/<run_name>/

generate/synthetic_data.csv -- the synthetic dataset
generate/evaluation_report.html -- quality and privacy scores
generate/evaluation_metrics.json -- machine-readable evaluation scores and timing
train/adapter/ -- trained adapter (reusable for more generation)

To run the same pipeline from Python, see Running Safe Synthesizer -- SDK.

→ Next step: read Evaluation to understand your first report and how to interpret SQS and DPS scores.

How the Pipeline Works¶

The pipeline has five stages. Each is independently configurable -- you can run the full pipeline in one step, or execute stages individually (train once, generate many times). You can find a brief overview of each stage below, or read Pipeline for in-depth descriptions.

flowchart LR
    Data["Data Input"] --> PII["PII Replacement<br/>(on by default)"]
    PII --> Train["Training"]
    Train --> Gen["Generation"]
    Gen --> Eval["Evaluation"]

1. Data Input¶

The pipeline loads your input data (CSV, JSON, JSONL, Parquet, or DataFrame) and prepares it for training:

Column type inference and validation
Grouping and ordering (if configured via data.group_training_examples_by and data.order_training_examples_by)
Train/test split -- a holdout set (default 5%) is reserved for evaluation
Records are serialized to JSONL and tokenized; records that exceed the model's context window raise a GenerationError rather than being silently truncated

Your dataset should have at least 1,000 records (10,000 records if enabling differential privacy).

2. PII Replacement¶

PII replacement is on by default as a pre-processing step. The PII replacer detects personally identifiable information (PII) using GLiNER NER and optional LLM-based column classification, then replaces detected entities with synthetic but realistic values prior to fine-tuning. This ensures the model never has the opportunity to learn the most sensitive information (e.g. names, addresses, identifiers) from the training data. See Supported Entity Types for the full entity list.

See Configuration -- Replacing PII for entity types, LLM classification setup, and SDK customization.

3. Training¶

Fine-tunes a base LLM using LoRA (Low-Rank Adaptation). Two backends are available: Unsloth (default, faster) and HuggingFace (required for differential privacy). Both perform LoRA fine-tuning; see Running -- Training for a comparison.

The default model is HuggingFaceTB/SmolLM3-3B. Safe Synthesizer has tested support for HuggingFaceTB/SmolLM3-3B, TinyLlama/TinyLlama-1.1B-Chat-v1.0, and mistralai/Mistral-7B-Instruct-v0.3 (see Configuration -- Training for details on how to change the backend or model).

Training requires 1 NVIDIA GPU (A100 or larger) to run. Multi-GPU training is not supported.

Differential privacy

For formal privacy guarantees, enable Differentially Private Stochastic Gradient Descent (DP-SGD) when fine-tuning via privacy.dp_enabled: true. See Configuration -- Differential Privacy.

4. Generation¶

Produces synthetic records using the trained LoRA adapter via vLLM. The generation stage samples from the fine-tuned model until the requested number of valid records is reached, with configurable stopping conditions for quality control.

See Configuration -- Generation.

5. Evaluation¶

Measures quality and privacy of the synthetic data and produces an HTML report with interactive visualizations. Two composite scores are reported:

SQS (Synthetic Quality Score) -- composite quality score with five subscores:
- Column Correlation Stability -- measures the correlation across every combination of two numeric and categorical columns
- Deep Structure Stability -- compares numeric and categorical columns in the training and synthetic data using Principal Component Analysis (PCA)
- Column Distribution Stability -- measures the distribution of each numeric and categorical column
- Text Structure Similarity -- measures the sentence, word, and character counts for text columns
- Text Semantic Similarity -- measures whether the semantic meaning in text columns held after synthesizing
DPS (Data Privacy Score) -- composite privacy score with three subscores:
- Membership Inference Protection -- measures whether a model trained on the data can distinguish training records from held-out records
- Attribute Inference Protection -- measures whether an attacker can infer a sensitive attribute from quasi-identifiers in the synthetic data
- PII Replay Detection -- checks whether PII from training appears in synthetic data

See Evaluation for how to interpret scores.

What to Read Next¶

After your first run, read Evaluation to understand your SQS and DPS scores and Synthetic Data Quality to learn how to improve generation quality. If your job failed to run, read Troubleshooting to learn how to fix common errors.

Guides¶

Configuration

Synthesis parameters for training, generation, PII, DP, evaluation, and time series.

→ Configuration
Running Safe Synthesizer

How to run the pipeline, CLI commands, individual stages, logging, and artifacts.

→ Running Safe Synthesizer
Environment Variables

Artifact paths, logging, model caching, NIM endpoints, and WandB.

→ Environment Variables
Troubleshooting

Common errors, OOM fixes, offline setup, and configuration gotchas.

→ Troubleshooting
Synthetic Data Quality

Improving generation quality, the quality-privacy tradeoff, and unavailable metrics.

→ Synthetic Data Quality