Running Safe Synthesizer¶

Full reference for pipeline execution. For a quick first run, see Getting Started. For parameter tables, see Configuration Reference. For environment variables, see Environment Variables.

Configuration Interfaces¶

NeMo Safe Synthesizer has two ways to run the pipeline and four-and-a-half ways to configure it.

Two ways to run:

safe-synthesizer CLI -- the command-line application
Python SDK -- the SafeSynthesizer builder, for use in scripts, notebooks, and services

Four-and-a-half ways to configure:

YAML config file -- a portable, versionable snapshot of parameters; passed to the CLI via --config or loaded in the SDK with SafeSynthesizerParameters.from_yaml()
CLI flags -- --generation__num_records 10000, --privacy__dp_enabled true; override the YAML file when both are provided
Python SDK builder calls -- .with_generate(num_records=10000), .with_differential_privacy(dp_enabled=True); override the YAML file when both are used
Dataset registry -- a YAML file (passed via --dataset-registry) that defines named datasets and their parameter overrides so you can refer to them by name in the CLI
Environment variables (the half) -- control infrastructure only: artifact paths, logging, model cache locations, WandB mode. They do not set synthesis parameters like learning rate or record count

The asymmetry matters: YAML and environment variables are configuration only -- they don't invoke the pipeline. CLI and SDK are run and configure -- they set parameters and execute.

All configuration surfaces share the same underlying Pydantic parameter models defined in src/nemo_safe_synthesizer/config/. The __ syntax used in CLI flags (e.g. --privacy__dp_enabled true) mirrors the nested structure of those models: privacy is the config section, dp_enabled is the field. Setting a parameter via YAML, CLI flag, or SDK call resolves to the same field in the same model.

Exactly what avenues of configuration are available, and thus how precedence is resolved, depends on how you run the pipeline. Settings are resolved in this order, from highest (first) to lowest priority (last):

CLI: CLI flags > dataset registry > YAML config file > model defaults
SDK: Python SDK builder calls > YAML config file > model defaults

See Configuration Precedence for details.

The same run, three ways -- 10,000 records with DP-SGD:

CLISDKConfig reference

safe-synthesizer run \
  --data-source data.csv \
  --generation__num_records 10000 \
  --privacy__dp_enabled true \
  --privacy__epsilon 8.0

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_generate(num_records=10000)
    .with_differential_privacy(dp_enabled=True, epsilon=8.0)
)
synthesizer.run()

# config.yaml
generation:
  num_records: 10000
privacy:
  dp_enabled: true
  epsilon: 8.0

safe-synthesizer run --config config.yaml --data-source data.csv

Running the Pipeline¶

The pipeline runs five stages in sequence. PII replacement is on by default as a pre-processing step; disable it with --no-replace-pii (CLI) or .with_replace_pii(enable=False) (SDK).

flowchart LR
    data[Data Input] --> pii["PII Replacement<br/>(on by default)"]
    pii --> train["Training<br/>LoRA fine-tune"]
    train --> gen["Generation<br/>vLLM sampling"]
    gen --> eval["Evaluation<br/>SQS + DPS report"]

Run the full end-to-end pipeline in one step:

CLISDK

safe-synthesizer run \
  --config config.yaml \
  --data-source data.csv \
  --artifact-path ./artifacts

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
synthesizer = SafeSynthesizer().with_data_source("data.csv")
synthesizer.run()

results = synthesizer.results

You can also run stages individually:

safe-synthesizer run train -- train only, saves the adapter
safe-synthesizer run generate -- generate only (use --auto-discover-adapter or --run-path)
SDK stepwise: process_data() → train() → generate() → evaluate()

Using YAML Config Files¶

A config.yaml file is optional for the CLI and SDK. When omitted, model defaults apply. When provided, CLI flags and SDK builder calls override the values from the file.

CLI¶

Pass --config to load a base config, then override individual fields with --key__subkey value syntax:

# All defaults, no config file
safe-synthesizer run --data-source data.csv

# Config file as base, override two fields
safe-synthesizer run \
  --config config.yaml \
  --data-source data.csv \
  --training__learning_rate 0.001 \
  --generation__num_records 2000

SDK¶

Pass a SafeSynthesizerParameters loaded from YAML as the seed, then use with_* calls to override specific sections:

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
from nemo_safe_synthesizer.config import SafeSynthesizerParameters

# Load base config from file, override generation settings
config = SafeSynthesizerParameters.from_yaml("config.yaml")
synthesizer = (
    SafeSynthesizer(config)
    .with_data_source("data.csv")
    .with_generate(num_records=2000, temperature=0.8)
)
synthesizer.run()

with_* keyword arguments take precedence over whatever is in the YAML file. Sections not mentioned in the builder call retain their values from config.

See Configuration Reference -- CLI Override Syntax for the full override precedence rules.

CLI Commands¶

safe-synthesizer --help

`run` -- Execute the Pipeline¶

Without a subcommand, run executes the full end-to-end pipeline (data processing, PII replacement, training, generation, evaluation). PII replacement is on by default.

safe-synthesizer run --config config.yaml --data-source data.csv

Common Options¶

These options apply to run and run generate. Only --data-source is required; all others have defaults or are optional.

Option	Env var	Default	Description
`--config`	`NSS_CONFIG`	(model defaults)	Path to YAML config file; omit to use all model defaults
`--data-source`	--	(required)	Dataset path, URL, or name from `--dataset-registry`
`--artifact-path`	`NSS_ARTIFACTS_PATH`	`./safe-synthesizer-artifacts`	Base directory for all runs
`--run-path`	--	--	Explicit run directory (for `run generate`, must point to an existing trained run)
`--output-file`	--	--	Path to output CSV file
`--log-format`	`NSS_LOG_FORMAT`	`plain` (TTY) / `json` (non-TTY)	Console log format -- auto-detected from TTY; accepts `plain` or `json`
`--log-file`	`NSS_LOG_FILE`	--	Log file path (defaults to run directory)
`--log-color` / `--no-log-color`	`NSS_LOG_COLOR`	auto	Colorize console output (auto-detected from TTY)
`--wandb-mode`	`NSS_WANDB_MODE`	`disabled`	WandB mode (`online`, `offline`, `disabled`)
`--wandb-project`	`NSS_WANDB_PROJECT`	--	WandB project name
`--dataset-registry`	`NSS_DATASET_REGISTRY`	--	Dataset registry YAML path/URL
`-v` / `-vv`	--	--	Verbose logging (`-v` debug, `-vv` debug + dependencies)

Synthesis Parameter Overrides¶

Any synthesis parameter can be overridden on the command line using --section__field syntax (e.g., --training__learning_rate 0.001). See Configuration Reference -- CLI Override Syntax for the full syntax, examples, and precedence rules.

`run train`¶

Train only -- saves the adapter without generating or evaluating.

safe-synthesizer run train --config config.yaml --data-source data.csv

Accepts the same common options and synthesis parameter overrides as run.

`run generate`¶

Generate only -- requires a previously trained adapter.

safe-synthesizer run generate \
  --config config.yaml \
  --data-source data.csv \
  --auto-discover-adapter

# Or specify an explicit run path
safe-synthesizer run generate \
  --config config.yaml \
  --data-source data.csv \
  --run-path ./safe-synthesizer-artifacts/myconfig---mydata/2026-01-15T12:00:00

Option	Description
`--auto-discover-adapter`	Find the latest trained adapter in the artifact directory
`--run-path`	Explicit path to a previous run's output directory
`--wandb-resume-job-id`	WandB run ID to resume (or path to file containing the ID)

Accepts the same common options and synthesis parameter overrides as run.

`artifacts clean`¶

Delete artifacts from a previous run:

safe-synthesizer artifacts clean --artifact-path ./safe-synthesizer-artifacts
safe-synthesizer artifacts clean --caches-only   # training cache only
safe-synthesizer artifacts clean --dry-run        # preview what would be deleted

Option	Description
`--artifact-path`	Path to artifact directory (defaults to `./safe-synthesizer-artifacts`)
`--dry-run`	Preview deletions without actually deleting
`--caches-only`	Only delete the training cache, keep everything else
`--force`	Skip confirmation prompt

Data Input¶

Provide your dataset as a file path, URL, DataFrame (SDK), or dataset registry name.

Data source options:

CLI / dataset registry: --data-source data.csv -- supports .csv, .json, .jsonl, .parquet, .txt
URL: --data-source https://example.com/data.csv
DataFrame (SDK): .with_data_source(df) -- supports any format you can load into pandas
CSV path (SDK): .with_data_source("data.csv") -- loaded via pd.read_csv; for non-CSV formats, load into a DataFrame first
Dataset registry name: --data-source my_dataset (with --dataset-registry registry.yaml)

Grouping and Ordering¶

Use data.group_training_examples_by to group records by a column (e.g., customer ID) so related rows are trained together. Use data.order_training_examples_by to sort within groups (requires group_by).

When to use grouped mode

Grouping is recommended when there is a natural ordering within each group -- i.e., data.order_training_examples_by points to a valid ordering field such as a date or sequence number. If your data has no meaningful intra-group order, tabular mode with shuffled records is usually sufficient.

CLISDKConfig reference

safe-synthesizer run \
  --data__group_training_examples_by customer_id \
  --data__order_training_examples_by transaction_date \
  --data-source transactions.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("transactions.csv")
    .with_data(
        group_training_examples_by="customer_id",
        order_training_examples_by="transaction_date",
    )
)

data:
  group_training_examples_by: "customer_id"
  order_training_examples_by: "transaction_date"

What the model sees

With grouping enabled, each training example is tokenized as:

[schema prompt] <BOS> group1-record1
group1-record2 <EOS> <BOS> group2-record1
group2-record2 <EOS>

Here <BOS> and <EOS> represent the model's begin-of-sequence and end-of-sequence tokens; the exact strings are taken from the selected model's metadata and may differ across model families.

data.max_sequences_per_example controls how many groups are packed into a single example (default: "auto", which resolves to 10 without DP). Fewer groups per example means more training examples overall. See Example Generation for a full walkthrough.

Dataset Registry¶

Define named datasets in a YAML file to reference them by name:

base_url: "/data/datasets"
datasets:
  - name: "customer_transactions"
    url: "customers/transactions.csv"
    overrides:
      data:
        group_training_examples_by: "customer_id"

safe-synthesizer run --dataset-registry registry.yaml --data-source customer_transactions

See Configuration Reference -- Data for the full parameter table.

PII Replacement¶

Enabled-by-default stage that runs before training. Detection works in two independent steps: GLiNER NER is used on columns detected as free text for named-entity patterns (names, emails, phone numbers, etc.) and replaces matches with synthetic placeholders. An optional second step uses an LLM to identify columns that are exclusively a single entity type (e.g., a column that is always SSNs), marking those columns for wholesale replacement before training. The two steps are independent -- NER runs on free-text content, LLM classification targets structured sensitive columns. PII replacement is on by default in both the CLI and SDK. PII on by default means no config flag is needed to enable it.

Skip PII replacement

If your dataset does not contain PII, you may disable this stage to reduce pipeline runtime:

CLI: --no-replace-pii
SDK: .with_replace_pii(enable=False)

CLISDKConfig reference

Default (PII on, no config needed):

safe-synthesizer run --data-source data.csv

Customize (e.g. enable LLM classification and restrict entity types): put the replace_pii block in a YAML file and pass it with --config. List-typed fields like entities cannot be set via CLI flags; use the config file (see Config reference tab) or SDK.

safe-synthesizer run --config pii_config.yaml --url data.csv

To override only non-list PII settings from the CLI, use the __ syntax, e.g. --replace_pii__globals__classify__enable_classify true.

PII replacement is on by default -- no with_replace_pii() call is needed for the standard case. Call it only to customize the config or to disable:

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
from nemo_safe_synthesizer.config.replace_pii import PiiReplacerConfig

# Default: PII on, no call needed
synthesizer = SafeSynthesizer().with_data_source("data.csv").with_train()

# Customize: enable LLM classification for specific entity types
pii_config = PiiReplacerConfig.get_default_config()
pii_config.globals.classify.enable_classify = True
pii_config.globals.classify.entities = ["email", "phone_number", "ssn"]

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_replace_pii(config=pii_config)
    .with_train()
    .with_generate(num_records=5000)
)

The SDK builder merges partial overrides with PiiReplacerConfig.get_default_config(), so you don't need to provide the full steps list.

replace_pii:
  globals:
    classify:
      enable_classify: true
      entities: ["email", "phone_number", "ssn"]
  steps:
    - rows:
        update:
          - condition: column.entity == "email" and not (this | isna)
            value: column.entity | fake
          - condition: column.entity == "phone_number" and not (this | isna)
            value: column.entity | fake
          - condition: column.entity == "ssn" and not (this | isna)
            value: column.entity | fake

steps is required and has no default. The snippet above shows a minimal single-step config. For the full default ruleset (50+ entity types), use PiiReplacerConfig.get_default_config() in the SDK and export it to YAML:

from nemo_safe_synthesizer.config.replace_pii import PiiReplacerConfig
PiiReplacerConfig.get_default_config().to_yaml("pii_config.yaml")

LLM Column Classification¶

To enable LLM-based PII column classification (optional), set the API key before running the pipeline. The endpoint defaults to https://integrate.api.nvidia.com/v1; override NSS_INFERENCE_ENDPOINT for a custom OpenAI-compatible endpoint.

When using the CLI, set both for column classification:

export NSS_INFERENCE_ENDPOINT="https://integrate.api.nvidia.com/v1"  # optional; this is the default
export NSS_INFERENCE_KEY="your-api-key"  # pragma: allowlist secret  (required for column classification with the inference endpoint)

PII column classification requires NSS_INFERENCE_KEY (and optionally NSS_INFERENCE_ENDPOINT if not using the default). When NSS_INFERENCE_KEY is unset, the classification step is attempted but falls back to NER-only detection (with an error log). No environment variables are required for NER-only PII replacement.

See Configuration Reference -- Replacing PII for the full parameter reference.

Training¶

Fine-tunes a pretrained LLM on your data using LoRA (Low-Rank Adaptation). LoRA inserts a small set of trainable adapter weights into the frozen pretrained model. Only the adapter is updated during training, which keeps VRAM requirements low and produces a compact artifact that can be reused for generation without re-training.

Two backends are available:

Backend	Description	When to use
Unsloth	LoRA fine-tuning with optimized kernels for faster training and lower VRAM usage. Uses Unsloth's `FastLanguageModel` for model loading and PEFT wrapping	Default -- use unless you need DP or a custom quantization setup
HuggingFace	LoRA fine-tuning via PEFT with 4-bit/8-bit quantization support and optional differential privacy (DP-SGD) via Opacus	Required for differential privacy; also the fallback when Unsloth is unavailable

If you enable differential privacy, the pipeline automatically switches to the HuggingFace backend.

Three models have been extensively tested:

Family	HuggingFace ID
SmolLM3 (default)	`HuggingFaceTB/SmolLM3-3B`
Mistral	`mistralai/Mistral-7B-Instruct-v0.3`
TinyLlama	`TinyLlama/TinyLlama-1.1B-Chat-v1.0`

We recommend you start with the default, HuggingFaceTB/SmolLM3-3B. However, depending on your use case, you may find a different model to be a better fit.

Based on testing, some trade-offs identified compared to SmolLM3 on average:

TinyLlama runs ~17% faster, while Mistral takes ~2x as long to run.
Mistral has ~6% increase in valid record fraction, while TinyLlama has ~7% decrease.
Mistral has ~5% higher job completion rate and TinyLlama has ~3% higher.
Mistral is comparable to SmolLM3 in Data Privacy Score, while TinyLlama has ~0.1 point decrease.
All 3 have comparable Synthetic Quality Scores.

CLISDKConfig reference

safe-synthesizer run \
  --training__learning_rate 0.001 \
  --training__batch_size 4 \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_train(learning_rate=0.001, batch_size=4)
)

training:
  pretrained_model: "HuggingFaceTB/SmolLM3-3B"
  learning_rate: 0.001
  batch_size: 4

Quantization¶

Enabling quantization reduces VRAM consumption at the cost of some numerical precision. Set training.quantize_model to true and choose a bit width with training.quantization_bits.

Setting	VRAM	Precision	Speed	Notes
No quantization	Highest	Full	Baseline	Use when VRAM is not a constraint
8-bit	~50% reduction	Near-full	Slightly slower	Good balance for most cases
4-bit	~75% reduction	Reduced	Faster	Use when VRAM is tight; may affect output quality

CLISDKConfig reference

safe-synthesizer run \
  --training__quantize_model true \
  --training__quantization_bits 4 \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_train(quantize_model=True, quantization_bits=4)
)

training:
  quantize_model: true
  quantization_bits: 4

Attention Backends¶

training.attn_implementation controls which attention kernel is used when loading the model. The default uses Flash Attention 3 via the HuggingFace Kernels Hub and falls back to sdpa when the kernels package is not installed.

Common values:

kernels-community/vllm-flash-attn3: Flash Attention 3 (default, requires kernels package)
flash_attention_2: Flash Attention 2 (requires flash-attn package)
sdpa: PyTorch scaled dot-product attention -- broadest compatibility
eager: standard PyTorch attention -- useful for debugging

Training vs generation attention backends

The training attention backend (training.attn_implementation) and the generation attention backend (generation.attention_backend / VLLM_ATTENTION_BACKEND) are independent settings.

Differential Privacy¶

Differential privacy (DP) provides a formal bound on what an adversary can learn about any individual record. Safe Synthesizer implements Differentially Private Stochastic Gradient Descent (DP-SGD) via Opacus.

CLISDKConfig reference

safe-synthesizer run \
  --privacy__dp_enabled true \
  --privacy__epsilon 8.0 \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_differential_privacy(dp_enabled=True, epsilon=8.0)
)

privacy:
  dp_enabled: true
  epsilon: 8.0

Compatibility constraints when DP is enabled:

Set training.use_unsloth to false or leave it at "auto" -- "auto" resolves to false when DP is enabled
data.max_sequences_per_example must be 1 (or "auto", which resolves to 1 when DP is enabled)
Gradient checkpointing is disabled (incompatible with Opacus)

DP training trade-offs

DP training is slower and typically requires more epochs to reach the same loss as non-DP training. Start with epsilon: 8.0 -- a common, practical threshold -- and lower it only if your privacy requirements demand it. Very low epsilon values (e.g., below 1.0) significantly degrade model utility.

See Configuration Reference -- Differential Privacy for the full parameter table.

Generation¶

Produces synthetic records using the trained LoRA adapter via vLLM. The generation stage runs a sampling loop: the model generates batches of records, each record is validated against the original dataset schema (correct columns, correct types, no malformed values), and valid records accumulate until num_records is reached. If too many consecutive batches produce mostly invalid records, the loop stops early.

flowchart TD
    start([Start]) --> batch[Generate batch]
    batch --> validate["Validate records\nagainst schema"]
    validate --> accum[Accumulate\nvalid records]
    accum --> enough{Enough\nrecords?}
    enough -- No --> patience{Patience\nexceeded?}
    patience -- No --> batch
    patience -- Yes --> stop([Stop early])
    enough -- Yes --> done([Done])

CLISDKConfig reference

safe-synthesizer run \
  --generation__num_records 5000 \
  --generation__temperature 0.7 \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_generate(num_records=5000, temperature=0.7)
)

generation:
  num_records: 5000
  temperature: 0.7

Structured Generation¶

Set generation.use_structured_generation to true to constrain the model's output so every record matches the dataset schema. This reduces the fraction of invalid records, typically at the cost of reducing the quality of the generated records. Use it when the pipeline struggles to produce valid records.

CLISDKConfig reference

safe-synthesizer run \
  --generation__use_structured_generation true \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_generate(use_structured_generation=True)
)

generation:
  use_structured_generation: true
  structured_generation_schema_method: "regex"

"regex": constructs a custom regex from the dataset schema. More comprehensive but slower.
"json_schema": passes a JSON Schema to the backend. Faster, but may miss edge cases.

Stopping Conditions¶

Generation stops early when too many consecutive batches produce mostly invalid records. generation.patience controls how many bad batches to tolerate; generation.invalid_fraction_threshold defines what counts as "bad." If the pipeline stops early, check the generation logs for the invalid record fraction per batch.

CLISDKConfig reference

safe-synthesizer run \
  --generation__patience 5 \
  --generation__invalid_fraction_threshold 0.6 \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_generate(patience=5, invalid_fraction_threshold=0.6)
)

generation:
  patience: 5
  invalid_fraction_threshold: 0.6

Early stopping

If the pipeline stops early due to patience, try enabling use_structured_generation: true to constrain outputs to the dataset schema, or lower temperature to reduce the chance of malformed records.

See Configuration Reference -- Generation for the full parameter table.

Evaluation¶

Measures quality and privacy of synthetic data and produces an HTML report with interactive visualizations. Scores are from 0-10, and higher is better. Two composite scores are reported:

SQS (Synthetic Quality Score) -- composite quality score with five subscores:
- Column Correlation Stability -- measures the correlation across every combination of two numeric and categorical columns
- Deep Structure Stability -- compares numeric and categorical columns in the training and synthetic data using Principal Component Analysis (PCA)
- Column Distribution Stability -- measures the distribution of each numeric and categorical column
- Text Structure Similarity -- measures the sentence, word, and character counts for text columns
- Text Semantic Similarity -- measures whether the semantic meaning in text columns held after synthesizing
DPS (Data Privacy Score) -- composite privacy score with three subscores:
- Membership Inference Protection -- measures whether a model trained on the data can distinguish training records from held-out records
- Attribute Inference Protection -- measures whether an attacker can infer a sensitive attribute from quasi-identifiers in the synthetic data
- PII Replay Detection -- checks whether PII from training appears in synthetic data

See Evaluation for details on score interpretation.

CLISDKConfig reference

safe-synthesizer run \
  --evaluation__mia_enabled false \
  --evaluation__aia_enabled false \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_evaluate(mia_enabled=False, aia_enabled=False)
)

evaluation:
  mia_enabled: true
  aia_enabled: true
  pii_replay_enabled: true

Disable Evaluation¶

To skip evaluation entirely (e.g., for faster iteration during development):

CLISDKConfig reference

safe-synthesizer run \
  --evaluation__enabled false \
  --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("data.csv")
    .with_evaluate(enabled=False)
)

evaluation:
  enabled: false

See Configuration Reference -- Evaluation for the full parameter table.

Time Series Mode¶

Experimental

Time series synthesis is an experimental feature. APIs and behavior may change between releases.

Enable time series mode by setting time_series.is_timeseries: true and providing timestamp configuration. Use data.group_training_examples_by to group records by entity (e.g., sensor ID) and data.order_training_examples_by to sort within groups.

CLISDKConfig reference

safe-synthesizer run \
  --time_series__is_timeseries true \
  --time_series__timestamp_column timestamp \
  --time_series__timestamp_interval_seconds 60 \
  --data__group_training_examples_by sensor_id \
  --data-source sensor_data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = (
    SafeSynthesizer()
    .with_data_source("sensor_data.csv")
    .with_time_series(
        is_timeseries=True,
        timestamp_column="timestamp",
        timestamp_interval_seconds=60,
    )
    .with_data(
        group_training_examples_by="sensor_id",
        order_training_examples_by="timestamp",
    )
)

time_series:
  is_timeseries: true
  timestamp_column: "timestamp"
  timestamp_interval_seconds: 60
data:
  group_training_examples_by: "sensor_id"
  order_training_examples_by: "timestamp"

See Configuration Reference -- Time Series for the full parameter table. See Troubleshooting -- Time Series for common issues.

How time-series examples are assembled

Each training example contains records from a single group in chronological order. The model learns to continue a sequence -- not to produce independent records. See Example Generation for assembly details.

Run Individual Stages¶

Train only¶

CLISDK

safe-synthesizer run train --config config.yaml --data-source data.csv

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

synthesizer = SafeSynthesizer().with_data_source("data.csv")
synthesizer.process_data()
synthesizer.train()

Generate only¶

Use --auto-discover-adapter to find the latest trained adapter, or --run-path for an explicit location. See run generate in the CLI Commands section for all options.

CLISDK

safe-synthesizer run generate \
  --config config.yaml \
  --data-source data.csv \
  --auto-discover-adapter

from pathlib import Path
from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
from nemo_safe_synthesizer.config import SafeSynthesizerParameters
from nemo_safe_synthesizer.cli.artifact_structure import Workdir

config = SafeSynthesizerParameters.from_yaml("config.yaml")
workdir = Workdir.from_path(
    Path("./safe-synthesizer-artifacts/myconfig---mydata/2026-01-15T12:00:00")
)
synthesizer = SafeSynthesizer(config, workdir=workdir)
synthesizer.load_from_save_path()
synthesizer.process_data()
synthesizer.generate()
synthesizer.evaluate()
synthesizer.save_results()

Stepwise execution (SDK)¶

For full control, call each stage individually:

synthesizer = (
    SafeSynthesizer()
    .with_data_source(df)
    .process_data()
    .train()
    .generate()
    .evaluate()
)

results = synthesizer.results
synthesizer.save_results()

Artifacts and Output¶

Each run writes to a directory named <config-stem>---<dataset-stem>/<run_name> under the artifact path. The config and dataset stems are derived from the filenames you pass to --config and --data-source, making it easy to identify runs at a glance. <run_name> defaults to an ISO 8601 timestamp (e.g., 2026-01-15T12:00:00).

To use an explicit output directory (skipping the auto-generated <config>---<dataset>/<run_name> structure), pass --run-path:

safe-synthesizer run --config config.yaml --data-source data.csv --run-path ./my-run

<artifact-path>/<config>---<dataset>/<run_name>/
├── train/
│   ├── safe-synthesizer-config.json
│   └── adapter/                     # trained PEFT adapter
├── generate/
│   ├── logs.jsonl                   # generate-only workflow
│   ├── info.json                    # generate-only workflow
│   ├── synthetic_data.csv
│   ├── evaluation_report.html
│   └── evaluation_metrics.json      # machine-readable metrics
├── dataset/
│   ├── training.csv
│   ├── test.csv
│   ├── validation.csv               # when training.validation_ratio > 0
│   └── transformed_training.csv     # when PII replacement transforms the data
└── logs/
    └── <phase>.jsonl                # e.g. end_to_end.jsonl or train.jsonl

Key outputs:

generate/synthetic_data.csv: the synthetic dataset
generate/evaluation_report.html: quality and privacy report
generate/evaluation_metrics.json: machine-readable evaluation scores and timing
train/adapter/: LoRA weights for resuming generation
train/safe-synthesizer-config.json: resolved config snapshot

Clean up artifacts

Adapter weights and training caches can consume significant disk space during iterative development. Run safe-synthesizer artifacts clean to remove them when no longer needed. Use --caches-only to keep the adapter but reclaim training cache space.

SDK Results Access¶

run() automatically saves synthetic_data.csv, evaluation_report.html, and evaluation_metrics.json to the artifacts directory unless an output_file override is provided. For stepwise execution, call save_results() explicitly after evaluate().

results = synthesizer.results
df = results.synthetic_data
summary = results.summary
# synthesizer.save_results()  # only needed for stepwise execution; run() saves automatically

Cleaning Up¶

See artifacts clean in the CLI Commands section for options.

Running in Offline Environments¶

Pre-cache models by running once with internet access, then set HF_HUB_OFFLINE=1 in your target environment. For detailed cache setup and environment variables (HF_HOME, HF_HUB_OFFLINE, LOCAL_FILES_ONLY, VLLM_CACHE_ROOT), see Environment Variables -- Hugging Face Cache.

For offline-specific errors, see Program Runtime.

Logging and Experiment Tracking¶

Log Format¶

Method	Setting
CLI	`--log-format json` or `--log-format plain`
Environment	`NSS_LOG_FORMAT=json`

The format auto-detects from the terminal: plain when stdout is a TTY, json otherwise.

PlainJSON

Human-readable columns separated by |. Used by default in interactive terminals.

2026-01-15T12:03:42.001 | Nemo Safe Synthesizer | user    | info  | training.py:TrainingBackend.train:87
Training complete

2026-01-15T12:03:42.105 | Nemo Safe Synthesizer | runtime | info  | generation.py:VllmBackend._generate:214
Batch complete: {'valid': 48, 'invalid': 2}

One JSON object per line. Used by default in non-TTY environments (CI, containers, log aggregators).

{"timestamp": "2026-01-15T12:03:42.001000Z", "level": "info", "filename": "training.py", "lineno": 87, "category": "user", "message": "Training complete"}
{"timestamp": "2026-01-15T12:03:42.105000Z", "level": "info", "filename": "generation.py", "lineno": 214, "category": "runtime", "message": "Batch complete", "valid": 48, "invalid": 2}

Log categories in both formats:

user -- user-relevant progress and results (training complete, generation done)
runtime -- internal operational details (memory, timings, batch stats)
system -- system-level events (startup, config loaded)
backend -- logs from dependencies (vLLM, HuggingFace, etc.)

Verbosity: -v for debug, -vv for debug + dependencies.

WandB Integration¶

WandB is configured via CLI flags or environment variables -- not in the YAML config file.

CLISDKEnvironment variables

safe-synthesizer run \
  --config config.yaml \
  --data-source data.csv \
  --wandb-mode online \
  --wandb-project my-experiments

import os
import wandb

os.environ["WANDB_API_KEY"] = "your-api-key"  # pragma: allowlist secret
wandb.init(project="my-experiments", mode="online")

synthesizer = SafeSynthesizer().with_data_source("data.csv")
synthesizer.run()

Unlike the CLI, the SDK does not auto-initialize WandB. You must call wandb.init(...) before synthesizer.run().

export WANDB_API_KEY="your-api-key"  # pragma: allowlist secret
export WANDB_PROJECT="my-experiments"
export NSS_WANDB_MODE="online"

These environment variables are read by the CLI only. SDK users must call wandb.init(...) explicitly.

For parameter precedence (CLI flags vs environment variables vs YAML), see Environment Variables -- Precedence.

Running Safe Synthesizer¶

Configuration Interfaces¶

Running the Pipeline¶

Using YAML Config Files¶

CLI¶

SDK¶

CLI Commands¶

run -- Execute the Pipeline¶

Common Options¶

Synthesis Parameter Overrides¶

run train¶

run generate¶

artifacts clean¶

Data Input¶

Grouping and Ordering¶

Dataset Registry¶

PII Replacement¶

LLM Column Classification¶

Training¶

Quantization¶

Attention Backends¶

Differential Privacy¶

Generation¶

Structured Generation¶

Stopping Conditions¶

Evaluation¶

Disable Evaluation¶

Time Series Mode¶

Run Individual Stages¶

Train only¶

Generate only¶

Stepwise execution (SDK)¶

Artifacts and Output¶

SDK Results Access¶

Cleaning Up¶

Running in Offline Environments¶

Logging and Experiment Tracking¶

Log Format¶

WandB Integration¶

`run` -- Execute the Pipeline¶

`run train`¶

`run generate`¶

`artifacts clean`¶