Skip to content

Configuration Reference

Parameter tables for all NeMo Safe Synthesizer configuration sections. For how to use each stage with examples, see Running Safe Synthesizer. For environment variables, see Environment Variables.


Configuration Precedence

Exactly what avenues of configuration are available, and thus how precedence is resolved, depends on how you run the pipeline. Settings are resolved in this order, from highest (first) to lowest priority (last):

  • CLI: CLI flags > dataset registry overrides > YAML config file > model defaults
  • SDK: SDK builder calls > YAML config file > model defaults

Each layer only overrides what it explicitly sets -- everything else falls through to the next layer.

Examples

Start from model defaults, override one field via CLI:

safe-synthesizer run --data-source data.csv --generation__num_records 2000

Use a YAML base for most settings, tune one field per run without editing the file:

safe-synthesizer run --config config.yaml --data-source data.csv \
  --training__learning_rate 0.001

Load a YAML base from Python, override a section with the builder:

from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
from nemo_safe_synthesizer.config import SafeSynthesizerParameters

config = SafeSynthesizerParameters.from_yaml("config.yaml")
synthesizer = (
    SafeSynthesizer(config)
    .with_data_source("data.csv")
    .with_generate(num_records=2000, temperature=0.8)  # overrides config.yaml values
)
synthesizer.run()

Note

Parameters that accept "auto" cannot be set to "auto" via CLI flags -- omit the flag to use the default, or set it in YAML.

See Using YAML Config Files with the CLI and SDK for more detail on combining config files with runtime overrides.


Training

NeMo Safe Synthesizer fine-tunes a pretrained language model on your tabular data using LoRA (Low-Rank Adaptation). See TrainingHyperparams for the full field list.

Field Default Description Guidance
training.learning_rate "auto" Initial learning rate for the AdamW optimizer. "auto" selects a model-specific default (Mistral: 1e-4, others: 5e-4) Leave at "auto" for most cases; override with a float in (0, 1) to tune manually
training.batch_size 1 Per-device batch size Leave at 1; increase gradient_accumulation_steps for a larger effective batch
training.gradient_accumulation_steps 8 Steps to accumulate before a backward pass; effective batch size = batch_size x this value 8--32 typical
training.num_input_records_to_sample "auto" Records the model sees during training -- proxy for training time ("auto" or int) First knob to increase if quality is low
training.lora_r 32 LoRA rank; lower values produce fewer trainable parameters 16--64 typical; 32 is a reasonable default
training.lora_alpha_over_r 1.0 LoRA scaling ratio (alpha / rank) Leave at 1.0
training.pretrained_model "HuggingFaceTB/SmolLM3-3B" HuggingFace model ID or local path See supported families below; TinyLlama/TinyLlama-1.1B-Chat-v1.0 for fast CPU/low-VRAM iteration
training.use_unsloth "auto" Use the Unsloth backend. Set to false or leave at "auto" when using DP ("auto" resolves to false when DP is enabled) Leave at "auto"
training.quantize_model false Enable quantization to reduce VRAM usage Enable if VRAM is limited; 8-bit has lower quality impact than 4-bit
training.quantization_bits 8 Bit width (4 or 8) when training.quantize_model is true Prefer 8 over 4 for quality
training.attn_implementation "kernels-community/vllm-flash-attn3" Attention backend for model loading Leave at default
training.rope_scaling_factor "auto" Scale the base model's context window via RoPE ("auto" or int) Leave at "auto"
training.validation_ratio 0.0 Fraction of training data held out for validation loss monitoring Leave at 0.0 unless you specifically want to monitor validation loss
training.max_vram_fraction 0.8 Fraction of total GPU VRAM to allocate for training. Must be in [0, 1] Lower if other GPU consumers are active on the same device

validation_ratio vs holdout

training.validation_ratio splits the training data to monitor validation loss during fine-tuning. data.holdout splits the full dataset to create a test set used by the evaluation stage. They serve different purposes and are applied at different stages.

Safe Synthesizer has explicit support (prompt templates, RoPE scaling, tokenizer handling) for the model families listed below. Models outside this list will raise a ValueError at startup.

We have extensively tested the following models for synthetic data use in NSS, and encourage you to start with SmolLM3-3B (the default).

Family HuggingFace ID
SmolLM3 (default) HuggingFaceTB/SmolLM3-3B
TinyLlama TinyLlama/TinyLlama-1.1B-Chat-v1.0
Mistral mistralai/Mistral-7B-Instruct-v0.3

Benchmarking data for additional models will be added as they are validated. To understand the trade-offs with model selection, see Training.

When training.pretrained_model is set to a Hugging Face Hub model ID, the model is downloaded from the Hub; if a local path or an offline cache is provided, no download is performed. See Pre-Caching Models for details.

Security Note: Pretrained models from Hugging Face Hub

Loading and using pretrained models from Hugging Face Hub (or any public source) can expose your environment to significant risks, including arbitrary code execution (ACE) or remote code execution (RCE) vulnerabilities. Only use models you have reviewed yourself or from organizations and authors you explicitly trust. Malicious or modified models may contain embedded code, backdoors, or privacy-leaking mechanisms.


Generation

The generation stage controls how the fine-tuned model produces synthetic records. See GenerateParameters for the full API reference.

Field Default Description Guidance
generation.num_records 1000 Number of synthetic records to generate Match or exceed input dataset size for best quality
generation.temperature 0.9 Sampling temperature; lower values produce more predictable, less varied output 0.7--1.1 typical; lower if output is noisy, higher if too repetitive
generation.top_p 1.0 Nucleus sampling probability Leave at 1.0; lower (e.g. 0.9) to reduce tail tokens
generation.repetition_penalty 1.0 Penalty for repeated tokens; increase slightly if generation produces repetitive output 1.0--1.15 typical; start at 1.05 if repetition is a problem
generation.patience 3 Consecutive bad batches before stopping Leave at default
generation.invalid_fraction_threshold 0.8 Invalid record fraction that triggers the patience counter Leave at default
generation.use_structured_generation false Enable structured output to constrain record format (typically at the cost of reducing the quality of generated records and increasing generation time; use when the pipeline struggles to produce valid records) Leave off unless the pipeline cannot produce valid records
generation.structured_generation_backend "auto" vLLM guided-decoding backend Leave at "auto"
generation.structured_generation_schema_method "regex" Schema method ("regex" or "json_schema") Leave at "regex"
generation.structured_generation_use_single_sequence false Match exactly one sequence when max_sequences_per_example is 1 Leave at default
generation.enforce_timeseries_fidelity false Enforce time series order, intervals, and timestamps Enable for time series data
generation.attention_backend "auto" vLLM attention backend Leave at "auto"

Advanced group-by validation knobs live under generation.validation:

Knob Default Effect
group_by_accept_no_delineator false Treat raw JSONL without BOS/EOS markers as a single group instead of rejecting
group_by_ignore_invalid_records false Drop invalid records from a group and keep the rest, rather than discarding the whole group
group_by_fix_non_unique_value false Normalize the group-by column to the first record's value when records disagree
group_by_fix_unordered_records false Re-sort records instead of rejecting out-of-order groups

See Example Generation -- Validation for guidance on when to enable each knob, and GenerateParameters for the full API reference.


Replacing PII

PII replacement detects and replaces personally identifiable information (PII) in your dataset before synthesis. It is on by default -- set replace_pii: null in YAML (or use --no-replace-pii on the CLI) to disable it. The replace_pii block is only needed when customizing entity types or classification via the SDK.

Key config parameters:

Field Default Description Guidance
replace_pii.globals.classify.enable_classify true Enable LLM-based PII column classification When using the CLI, set NSS_INFERENCE_KEY (and optionally NSS_INFERENCE_ENDPOINT); set to false if no LLM endpoint is available
replace_pii.globals.classify.entities (see default list) Entity types used for LLM-based column classification. Defaults to 15 types covering names, addresses, phone numbers, emails, SSN, national/tax IDs, and credit/debit cards -- see PII Replacement and PiiReplacerConfig Override to add or remove entity types from classification
replace_pii.globals.ner.ner_threshold 0.3 GLiNER confidence threshold for NER detection Lower to catch more entities (more false positives); raise to reduce false positives

See PiiReplacerConfig for the full schema.


Differential Privacy

Differential privacy (DP) provides a formal bound on what an adversary can learn about any individual record. Safe Synthesizer implements DP-SGD (Differentially Private Stochastic Gradient Descent) via Opacus.

Field Default Description Guidance
privacy.dp_enabled false Enable DP-SGD training Enable for formal privacy guarantees
privacy.epsilon 8.0 Privacy budget -- lower values give stronger privacy 4.0--12.0 typical; values below 4.0 may make convergence difficult
privacy.delta "auto" Privacy failure probability ("auto" or float) Leave at "auto"
privacy.per_sample_max_grad_norm 1.0 Max L2 norm for per-sample gradients Leave at 1.0

Compatibility constraints:

  • Set training.use_unsloth to false or leave it at "auto" -- "auto" resolves to false when DP is enabled (Unsloth is incompatible with Opacus's per-sample gradient hooks)
  • data.max_sequences_per_example must be 1 (or "auto", which resolves to 1 when DP is enabled) -- must be 1 to limit each example's contribution to the gradient, which DP requires
  • Safe Synthesizer disables gradient checkpointing automatically when DP is enabled -- no user action required (gradient checkpointing is incompatible with Opacus)

See DifferentialPrivacyHyperparams for the full field list. For DP error diagnostics, see Synthetic Data Quality.


Data

Field Default Description Guidance
data.holdout 0.05 Fraction (0--1) or absolute count (>1) for the holdout test set for evaluation 0.05--0.15 typical
data.max_holdout 2000 Upper cap on holdout size Leave at default for most datasets
data.random_state null Random seed -- auto-generated if null; set an explicit integer for reproducible splits Set to a fixed integer for reproducibility
data.group_training_examples_by null Column to group records by Use for multi-row entities (e.g. patient ID, session ID)
data.order_training_examples_by null Column to order within groups (requires data.group_training_examples_by) Use with a timestamp column for time series data
data.max_sequences_per_example "auto" Max sequences per example (1 for DP, defaults to 10 otherwise) Leave at "auto"

See DataParameters for the full field list.


Time Series

Experimental

Time series synthesis is an experimental feature. APIs and behavior may change between releases.

Field Default Description Guidance
time_series.is_timeseries false Enable time series mode Enable for datasets with sequential time-ordered records
time_series.timestamp_column null Timestamp column name Required when is_timeseries: true
time_series.timestamp_interval_seconds null Fixed interval between timestamps Set if your data has a regular sampling interval
time_series.timestamp_format null strftime format or "elapsed_seconds" Required when is_timeseries: true
time_series.start_timestamp null Override start timestamp for all groups (inferred from data if null) Leave null to infer from data
time_series.stop_timestamp null Override stop timestamp for all groups (inferred from data if null) Leave null to infer from data

See TimeSeriesParameters for the full schema. For detailed descriptions and constraints, see the Time Series README.


Evaluation

Field Default Description Guidance
evaluation.enabled true Master switch for evaluation Leave enabled
evaluation.mia_enabled true Membership Inference Attack (MIA) -- privacy risk assessment Disable to speed up evaluation if privacy assessment is not needed
evaluation.aia_enabled true Attribute Inference Attack (AIA) -- measures whether an attacker can infer a sensitive attribute from quasi-identifiers in the synthetic data Disable to speed up evaluation if AIA is not needed
evaluation.pii_replay_enabled true PII replay detection -- checks whether PII from training appears in synthetic data Leave enabled if PII replacement is used
evaluation.sqs_report_columns 250 Max columns in the Synthetic Quality Score (SQS) report Increase if your dataset has more columns
evaluation.sqs_report_rows 5000 Max rows in the SQS report Increase for larger datasets (impacts report generation time)
evaluation.quasi_identifier_count 3 Number of quasi-identifiers sampled for AIA (auto-reduced for small datasets) Leave at default
evaluation.mandatory_columns null Number of mandatory columns that must be used in evaluation Leave at default

See EvaluationParameters for the full API reference.


Validate and Modify Configuration

config validate

Check your config for errors and display the merged parameters:

safe-synthesizer config validate --config config.yaml
safe-synthesizer config validate --config config.yaml --training__learning_rate 0.001

Fields set to "auto" remain as "auto" in the output -- auto-resolution happens at runtime during process_data(), not at validation time. To see resolved values, check safe-synthesizer-config.json in the run directory after a pipeline run.

config modify

Modify a configuration and optionally save the result:

safe-synthesizer config modify --config config.yaml --training__learning_rate 0.001 --output modified.yaml
Option Description
--config Path to YAML config file (optional -- omit to build from overrides only)
--output Path to write modified YAML config (prints JSON to stdout if omitted)

config create

Create a new configuration from defaults:

safe-synthesizer config create --output config.yaml
safe-synthesizer config create --training__pretrained_model "HuggingFaceTB/SmolLM3-3B" --output config.yaml
Option Description
--output / -o Path to write YAML config (prints JSON to stdout if omitted)

CLI Override Syntax

Use double underscores to address nested fields:

safe-synthesizer run --config config.yaml --data-source data.csv \
  --training__learning_rate 0.001 \
  --data__holdout 0.1 \
  --generation__num_records 5000

Override precedence

CLI overrides > dataset registry overrides > YAML config file > model defaults. See Configuration Precedence for examples. Parameters that accept "auto" cannot be set to "auto" via CLI flags -- omit the flag to use the default, or set it in YAML.


Configuration Sections

YAML Key SDK Method API Reference
data with_data() DataParameters
training with_train() TrainingHyperparams
generation with_generate() GenerateParameters
evaluation with_evaluate() EvaluationParameters
replace_pii (null to disable) with_replace_pii() / with_replace_pii(enable=False) PiiReplacerConfig
privacy (null to disable) with_differential_privacy() DifferentialPrivacyHyperparams
time_series with_time_series() TimeSeriesParameters