NeMo Safe Synthesizer¶
NeMo Safe Synthesizer creates private, safe versions of sensitive tabular datasets -- entirely synthetic data with no one-to-one mapping to your original records. It uses LLM fine-tuning with optional differential privacy to produce high-quality datasets that preserve the statistical properties and utility of your data for downstream AI tasks while ensuring privacy compliance and protecting sensitive information.
Key Features¶
- Privacy-first synthetic data -- PII detection and replacement, optional differential privacy while fine-tuning via Opacus
- LLM fine-tuning -- LoRA fine-tuning optimized for tabular data, including numeric, categorical, and text columns
- Fast inference -- vLLM-powered generation with optional structured output enforcement
- Comprehensive evaluation -- Privacy and quality metrics in an in-depth HTML report
- Flexible interfaces -- CLI for scripting, Python SDK for programmatic workflows, YAML configuration
System Requirements
NeMo Safe Synthesizer requires a Linux machine with an NVIDIA GPU (A100 80GB+ recommended) and CUDA 12.9+ to run the training and generation pipeline. macOS, Windows, and Apple Silicon are not supported for pipeline execution. A CPU-only install is available for development and configuration validation -- see Getting Started.
Next Steps¶
-
Getting Started
Install the package, set up your environment, and run your first synthetic data pipeline in minutes.
-
Product Overview
Learn about the pipeline steps: replace PII, synthesize data, evaluate.
-
Tutorials
Follow hands-on tutorials to generate synthetic data.
-
User Guide
Configure and run the pipeline via YAML, CLI, SDK, or environment variables.
-
Developer Guide
Browse the auto-generated API reference and dive into the architecture details.
-
Dev Notes
Read developer blog posts.
Telemetry & Privacy¶
NeMo Safe Synthesizer includes an optional function to share anonymous telemetry data with NVIDIA for product improvement. Data collected is limited to run-level operational metrics (such as final run status, processing time, record and token counts, configuration parameters, top-level quality and privacy scores, base model used, deployment type, and GPU type). No user or device information is collected. This data is used to prioritize product improvements and will be shared in aggregate with the community. It is not used to track any individual user behavior.
You may opt out of telemetry collection at any time. Opting out applies only to data collection by the NeMo Safe Synthesizer library itself. To disable telemetry in a YAML config, set:
To disable telemetry for one CLI invocation, pass --emit_telemetry false:
To disable telemetry for the current shell, set NEMO_TELEMETRY_ENABLED=false (other accepted disabling values: 0, no) in your environment before running:
Use of third-party endpoints, including NVIDIA Build: NeMo Safe Synthesizer can be configured to use various inference endpoints, including build.nvidia.com (NVIDIA Build). If you choose to use NVIDIA Build or any other third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within NeMo Safe Synthesizer does not extend to data collection by your chosen endpoint. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not submit any confidential information or personal data when using NVIDIA Build.
Contact¶
License¶
NeMo Safe Synthesizer is licensed under the Apache License 2.0.