š NeMo Safe Synthesizer Tutorial: The Basics¶
What you'll learn¶
In this notebook, we'll explore the fundamentals of NeMo Safe Synthesizer: PII replacement, training on a sample dataset, generating synthetic data, and evaluating quality and privacy.
This library supports numeric, categorical, and text fields within the training data and generates realistic synthetic data that mirrors the structure of your data. A full run takes about 15 minutes on an A100.
š„ļø Prerequisites¶
This notebook requires a Linux machine with an NVIDIA GPU (H100 recommended, A100 minimum) and CUDA 12.8+. It will not run on macOS, Windows, or Apple Silicon.
┠Install Safe Synthesizer¶
Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and the datasets library for the sample dataset.
%%capture
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
!uv pip install "nemo-safe-synthesizer[engine,cu128]" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match
!uv pip install datasets
š Set the inference API key for PII column classification¶
NeMo Safe Synthesizer uses an LLMābased column classifier to automatically infer PII columns. To enable this feature, set NSS_INFERENCE_KEY (the inference endpoint defaults to the NVIDIA integrate URL. You can obtain an API key from build.nvidia.com). Setting this value is optional but strongly recommended.
import os
import getpass
# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.
if "NSS_INFERENCE_KEY" not in os.environ:
os.environ["NSS_INFERENCE_KEY"] = getpass.getpass("Paste inference API key (or press Enter to skip): ")
if os.environ.get("NSS_INFERENCE_KEY"):
print("NSS_INFERENCE_KEY is set")
else:
print(
"NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. "
"We strongly recommend setting a key."
)
š„ Load and preview sample dataset¶
Load a tabular datasetāin this example, the clinc_oos dataset from Hugging Faceāand preview the first few rows. NeMo Safe Synthesizer will use this DataFrame as its training data.
This dataset includes a text column and a categorical intent label supported by Nemo Safe Synthesizer.
Each user is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.
from datasets import load_dataset
dataset = load_dataset("clinc/clinc_oos", "small")
df = dataset["train"].to_pandas()
df.head()
āļø Create and run Safe Synthesizer job¶
Create the Safe Synthesizer builder and attach your DataFrame. Run the pipeline with run(), which performs data processing, PII replacement, training, generation, and evaluation in a single call. Results are available on builder.results.
Refer to the configuration docs for the full list of options.
from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
builder = SafeSynthesizer().with_data_source(df) # .with_replace_pii(enable=False) to disable PII replacement
builder.run()
results = builder.results
š¤ Retrieve synthetic data¶
Inspect the generated synthetic data including row count and preview of the first rows.
synth = results.synthetic_data
print(f"Number of synthetic rows: {len(synth)}")
synth.head()
# Synthetic data and evaluation report are automatically saved to the artifacts directory
print(f"Artifacts automatically saved to: {builder._workdir.generate.path}")
š”ļø Review evaluation report¶
The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document.
import json
print("Summary (timing and scores):")
print(json.dumps(results.summary.model_dump(), indent=2))
# View the evaluation report in a sandboxed iframe
import base64
from IPython.display import IFrame, display
report_html = results.evaluation_report_html
if report_html:
data_url = "data:text/html;base64," + base64.b64encode(report_html.encode()).decode()
display(IFrame(src=data_url, width="100%", height=800))