Architecture¶
Overview¶
NeMo Safe Synthesizer is a comprehensive package for generating safe synthetic data with privacy guarantees. The architecture follows a pipeline design with configurable stages for data processing, PII replacement, training, generation, and evaluation.
High-Level Architecture¶
graph TB
subgraph entryPoints [Entry Points]
CLI["CLI Interface"]
SDK["SDK Interface"]
end
subgraph configLayer [Configuration Layer]
ConfigBuilder["ConfigBuilder"]
SafeSynthesizerParams["SafeSynthesizerParameters"]
subgraph configComponents [Config Components]
DataConfig["DataParameters"]
TrainConfig["TrainingHyperparams"]
GenConfig["GenerateParameters"]
EvalConfig["EvaluationParameters"]
PIIConfig["PiiReplacerConfig"]
DPConfig["DifferentialPrivacyHyperparams"]
end
end
subgraph dataProcessing [Data Processing Pipeline]
DataSource["Input Data"]
Holdout["Holdout - Train/Test Split"]
PIIReplacer["PII Replacer - NemoPII"]
DataActions["ActionExecutor"]
Assembler["ExampleAssembler"]
end
subgraph trainingBackend [Training Backend]
TrainingBackendBase["TrainingBackend - Abstract"]
HFBackend["HuggingFaceBackend"]
UnslothBackend["UnslothBackend"]
subgraph trainingComponents [Training Components]
ModelLoader["Model Loader"]
QuantizationComp["Quantization 4-bit/8-bit"]
LoRA["LoRA Config - PEFT"]
DPTrainer["Differential Privacy"]
Callbacks["Training Callbacks"]
end
end
subgraph generationBackend [Generation Backend]
GenBackend["GeneratorBackend - Abstract"]
VLLMBackend["VllmBackend"]
subgraph generationComponents [Generation Components]
RegexManager["RegexManager"]
BatchGen["BatchGenerator"]
Processors["Processors"]
Stopping["Stopping Criteria"]
end
end
subgraph evaluationSystem [Evaluation System]
EvaluatorComp["Evaluator"]
subgraph evaluationComponents [Evaluation Components]
DataPrivacy["Data Privacy Score"]
PIIReplay["PII Replay Detection"]
MembershipInf["Membership Inference Protection"]
AttributeInf["Attribute Inference Protection"]
Distribution["Column Distributions"]
Correlation["Correlations"]
TextSimilarity["Text Semantic Similarity"]
StructureSimilarity["Text Structure Similarity"]
SQS["SQS Score"]
end
subgraph reporting [Reporting]
ReportGen["Report Generator"]
HTMLReport["HTML Report"]
end
end
CLI --> ConfigBuilder
SDK --> ConfigBuilder
ConfigBuilder --> SafeSynthesizerParams
SafeSynthesizerParams --> DataConfig
SafeSynthesizerParams --> TrainConfig
SafeSynthesizerParams --> GenConfig
SafeSynthesizerParams --> EvalConfig
SafeSynthesizerParams --> PIIConfig
SafeSynthesizerParams --> DPConfig
DataSource --> Holdout
Holdout --> PIIReplacer
PIIReplacer --> DataActions
DataActions --> Assembler
Assembler --> TrainingBackendBase
TrainingBackendBase --> HFBackend
TrainingBackendBase --> UnslothBackend
HFBackend --> ModelLoader
HFBackend --> QuantizationComp
HFBackend --> LoRA
HFBackend --> DPTrainer
HFBackend --> Callbacks
HFBackend --> GenBackend
GenBackend --> VLLMBackend
VLLMBackend --> RegexManager
VLLMBackend --> BatchGen
VLLMBackend --> Processors
VLLMBackend --> Stopping
VLLMBackend --> EvaluatorComp
EvaluatorComp --> DataPrivacy
EvaluatorComp --> PIIReplay
EvaluatorComp --> MembershipInf
EvaluatorComp --> AttributeInf
EvaluatorComp --> Distribution
EvaluatorComp --> Correlation
EvaluatorComp --> TextSimilarity
EvaluatorComp --> StructureSimilarity
EvaluatorComp --> SQS
EvaluatorComp --> ReportGen
ReportGen --> HTMLReport
Configuration System¶
Two paths produce a SafeSynthesizerParameters object: the CLI path (via Click
decorators and YAML merging) and the SDK path (via the builder pattern). Both
converge on the same Pydantic model and handle nullable sub-configs
(replace_pii, privacy) uniformly -- None means disabled.
flowchart TB
subgraph cli [CLI Entry Points]
run_cmd["safe-synthesizer run"]
train_cmd["safe-synthesizer run train"]
gen_cmd["safe-synthesizer run generate"]
val_cmd["safe-synthesizer config validate"]
mod_cmd["safe-synthesizer config modify"]
create_cmd["safe-synthesizer config create"]
end
subgraph decorators [Decorator Layer]
common["@common_run_options"]
pydantic["@pydantic_options SSP"]
end
subgraph collector ["pydantic_click_options.py"]
collect["_collect_params"]
leaf["LeafParam"]
flag["FlagParam"]
parse["parse_overrides"]
end
subgraph settings [CLI Settings]
clisettings["CLISettings.from_cli_kwargs"]
common_setup["common_setup"]
end
subgraph merge [Config Assembly]
merge_overrides["merge_overrides"]
model_validate["SSP.model_validate"]
from_yaml["SSP.from_yaml"]
end
subgraph sdk [SDK Entry Point]
builder["SafeSynthesizer / ConfigBuilder"]
with_methods["with_replace_pii / with_differential_privacy / with_train / ..."]
resolve["_resolve_nss_config"]
end
subgraph config [SafeSynthesizerParameters]
data_p["DataParameters"]
training_p["TrainingHyperparams"]
gen_p["GenerateParameters"]
eval_p["EvaluationParameters"]
pii_p["PiiReplacerConfig | None"]
dp_p["DifferentialPrivacyHyperparams | None"]
ts_p["TimeSeriesParameters"]
end
subgraph runtime [Runtime Checks]
pii_check["replace_pii is not None"]
dp_check["privacy is not None"]
end
run_cmd & train_cmd & gen_cmd --> common
run_cmd & train_cmd & gen_cmd & val_cmd & mod_cmd & create_cmd --> pydantic
pydantic --> collect
collect --> leaf & flag
flag -->|"--no-replace-pii"| parse
flag -->|"--no-privacy"| parse
leaf -->|"--training__lr etc"| parse
parse -->|"overrides dict"| clisettings
clisettings --> common_setup
common_setup --> merge_overrides
val_cmd & mod_cmd & create_cmd -->|"direct"| merge_overrides
merge_overrides --> from_yaml
merge_overrides --> model_validate
model_validate --> config
builder --> with_methods
with_methods --> resolve
resolve -->|"SSP constructor"| config
config --> data_p & training_p & gen_p & eval_p & pii_p & dp_p & ts_p
pii_p --> pii_check
dp_p --> dp_check
Exactly what avenues of configuration are available, and thus how precedence is resolved, depends on how you run the pipeline. Settings are resolved in this order, from highest (first) to lowest priority (last):
- CLI: CLI flags > dataset registry overrides > YAML config file > defaults
- SDK: Python SDK builder calls > YAML config file > defaults
Nullable sub-configs (PiiReplacerConfig | None, DifferentialPrivacyHyperparams | None)
use None as the sole disabled signal. The @pydantic_options decorator auto-generates
--no_<field> is-flags for these fields; parse_overrides translates them into
{field: None} in the overrides dict.
Execution Flow¶
sequenceDiagram
participant User
participant CLI_SDK as CLI/SDK
participant ConfigBuilder
participant Holdout
participant PIIReplacer
participant Assembler
participant Training
participant Generation
participant Evaluation
participant Report
User->>CLI_SDK: Input Data + Config
CLI_SDK->>ConfigBuilder: Build Configuration
ConfigBuilder->>Holdout: Split Train/Test
Holdout->>PIIReplacer: Process Training Data
Note over PIIReplacer: On by default: replace PII entities with synthetic values
PIIReplacer->>Assembler: Tokenize and Assemble
Note over Assembler: Create training examples with proper formatting
Assembler->>Training: Training Dataset
Note over Training: Fine-tune LLM with LoRA + optional DP
Training->>Generation: Adapter Path
Note over Generation: Generate synthetic records using VLLM backend
Generation->>Evaluation: Synthetic Data
Note over Evaluation: Run all evaluation components and metrics
Evaluation->>Report: Evaluation Results
Report->>User: HTML Report + Synthetic Data
Simple Overview¶
flowchart LR
B[("data")]
B --> C("PII replacement\non by default")
C --> D("assemble examples")
D --> E("Fine-tune")
E --> F["Generate Samples"]
F --> G["Evaluate"]
Component Details¶
1. Configuration Layer¶
Path: src/nemo_safe_synthesizer/config/
- SafeSynthesizerParameters: main configuration class that aggregates all parameters
- DataParameters: dataset and preprocessing configurations
- TrainingHyperparams: training settings (learning rate, epochs, batch size, etc.)
- GenerateParameters: generation settings (temperature, top_p, num_records, etc.)
- EvaluationParameters: evaluation component toggles and settings
- PiiReplacerConfig: PII detection and replacement settings
- DifferentialPrivacyHyperparams: DP training parameters (epsilon, delta, clipping norm)
2. Data Processing Pipeline¶
Path: src/nemo_safe_synthesizer/data_processing/
- Holdout (
holdout/): splits data into train/test sets with stratification support - NemoPII (
pii_replacer/): detects PII entities (names, emails, SSN, etc.) and replaces with synthetic but realistic values - ActionExecutor (
actions/): executes data transformations (date normalization, distributions) - ExampleAssembler (
assembler.py): converts records to JSON format, tokenizes for model training, handles truncation and padding
3. Training Backend¶
Path: src/nemo_safe_synthesizer/training/
| Backend | Description |
|---|---|
| HuggingFaceBackend | Quantization (4-bit, 8-bit), LoRA via PEFT, Differential Privacy via Opacus |
| UnslothBackend | Optimized training with Unsloth library |
4. Generation Backend¶
Path: src/nemo_safe_synthesizer/generation/
- VllmBackend: fast inference using VLLM with LoRA adapter support
- RegexManager: enforces structured output (JSON format)
- BatchGenerator: manages batch generation with retry logic
- Processors: post-processing of generated text
5. Evaluation System¶
Path: src/nemo_safe_synthesizer/evaluation/
Components include: Data Privacy Score, PII Replay Detection, Membership Inference Protection, Attribute Inference Protection, Column Distributions, Correlations, Text Semantic Similarity, Text Structure Similarity, and SQS Score.
6. Supporting Modules¶
- LLM Utilities (
llm/): model metadata, loading, and memory management - Privacy Module (
privacy/dp_transformers/): Opacus integration for DP-SGD - Artifacts (
artifacts/): data quality checks, field analysis, metadata management - Records System (
data_processing/records/): JSON record and fragment handling
Key Design Patterns¶
Builder Pattern¶
The ConfigBuilder and SafeSynthesizer classes use the builder pattern for fluent configuration:
synthesizer = (
SafeSynthesizer(config)
.with_data_source(df)
.with_train(learning_rate=0.0001)
.with_generate(num_records=10000)
.with_evaluate(enabled=True)
)
synthesizer.run()
Backend Abstraction¶
Training and generation backends use abstract base classes to allow multiple implementations:
- Training: HuggingFace, Unsloth
- Generation: VLLM (extensible to others)
Component-Based Evaluation¶
Evaluation uses a modular component system where each metric is a separate component that can be enabled/disabled.
Pipeline Architecture¶
The execution follows a clear pipeline: Data → PII Replacement → Training → Generation → Evaluation
Technology Stack¶
| Category | Tools |
|---|---|
| ML Frameworks | PyTorch, Transformers, PEFT (LoRA) |
| Inference | VLLM |
| Privacy | Opacus for Differential Privacy |
| Data | Pandas, HuggingFace Datasets |
| Config | Pydantic |
| CLI | Click |
| Visualization | Jinja2, HTML/CSS/JS |
Extension Points¶
- Custom Training Backend: implement
TrainingBackendabstract class - Custom Generation Backend: implement
GeneratorBackendabstract class - Custom Evaluation Component: extend
Componentbase class - Custom Data Actions: add to
data_processing/actions/ - Custom PII Detectors: extend NER pipeline