🔐 NeMo Safe Synthesizer Tutorial: Differential Privacy¶

Learn how to apply differential privacy to achieve strong privacy with mathematical guarantees. This tutorial demonstrates how to configure differential privacy parameters for optimal results. The runtime of this notebook is about 1 hour on an A100.

If you have not yet completed the Safe Synthesizer 101 tutorial, consider starting there first.

🖥️ Prerequisites¶

This notebook requires a Linux machine with an NVIDIA GPU (H100 recommended, A100 minimum) and CUDA 12.8+. It will not run on macOS, Windows, or Apple Silicon.

⚡ Install Safe Synthesizer¶

Run the cell below to install NeMo Safe Synthesizer (engine and CUDA 12.8) and kagglehub for the example dataset.

In [ ]:

Copied!





%%capture
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

!uv pip install "nemo-safe-synthesizer[engine,cu128]" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match
!uv pip install kagglehub
%%capture
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

!uv pip install "nemo-safe-synthesizer[engine,cu128]" --index https://flashinfer.ai/whl/cu128 --index https://download.pytorch.org/whl/cu128 --index-strategy unsafe-best-match
!uv pip install kagglehub

🔑 Set the inference API key for PII column classification¶

NeMo Safe Synthesizer uses an LLM‑based column classifier to automatically infer PII columns. To enable this feature, set NSS_INFERENCE_KEY. By default, the inference endpoint is https://integrate.api.nvidia.com/v1 (the NVIDIA integrate URL). You can obtain an API key from build.nvidia.com). Setting this value is optional but strongly recommended.

In [ ]:

Copied!





import os
import getpass

# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.
if "NSS_INFERENCE_KEY" not in os.environ:
    os.environ["NSS_INFERENCE_KEY"] = getpass.getpass("Paste inference API key (or press Enter to skip): ")
if os.environ.get("NSS_INFERENCE_KEY"):
    print("NSS_INFERENCE_KEY is set")
else:
    print(
        "NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. "
        "We strongly recommend setting a key."
    )
import os
import getpass

# Setting NSS_INFERENCE_KEY is optional but strongly recommended for PII replacement.
if "NSS_INFERENCE_KEY" not in os.environ:
    os.environ["NSS_INFERENCE_KEY"] = getpass.getpass("Paste inference API key (or press Enter to skip): ")
if os.environ.get("NSS_INFERENCE_KEY"):
    print("NSS_INFERENCE_KEY is set")
else:
    print(
        "NSS_INFERENCE_KEY is not set. Replace PII will run in degraded mode. "
        "We strongly recommend setting a key."
    )

📥 Load and preview sample dataset¶

Load a tabular dataset—in this example, the US Accidents dataset from Kaggle—and preview the first few rows. NeMo Safe Synthesizer will use a subset to keep runtime manageable.

This dataset includes text, categorical, and numeric fields, all of which are supported by Safe Synthesizer.

The code below also computes a recommended delta for differential privacy. Delta should reflect the full dataset size, not the subset, because it bounds the probability of a privacy breach across the entire population. See Differential Privacy for parameter guidance.

Dataset citations:

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. "A Countrywide Traffic Accident Dataset.", 2019.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

Each user is responsible for checking the content of the dataset and the applicable licenses and determining if it is suitable for the intended use.

In [ ]:

Copied!





import pandas as pd
import kagglehub

path = kagglehub.dataset_download("sobhanmoosavi/us-accidents")
print("Path to dataset files:", path)
df = pd.read_csv(f"{path}/US_Accidents_March23.csv", index_col=0)
full_data_size = len(df)
recommended_delta = 1 / (full_data_size ** 2)  # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input

print(f"Full dataset size: {len(df)} records")
print(f"Recommended delta: {recommended_delta:.2e}")
import pandas as pd
import kagglehub

path = kagglehub.dataset_download("sobhanmoosavi/us-accidents")
print("Path to dataset files:", path)
df = pd.read_csv(f"{path}/US_Accidents_March23.csv", index_col=0)
full_data_size = len(df)
recommended_delta = 1 / (full_data_size ** 2)  # delta should reflect the full dataset, even when a subset is used as Safe Synthesizer input

print(f"Full dataset size: {len(df)} records")
print(f"Recommended delta: {recommended_delta:.2e}")

In [ ]:

Copied!





# use a subset as Safe Synthesizer input for faster runtime
df = df.sample(n=26250, random_state=318)
print(f"Input dataset size: {len(df)} records")
df.head()
# use a subset as Safe Synthesizer input for faster runtime
df = df.sample(n=26250, random_state=318)
print(f"Input dataset size: {len(df)} records")
df.head()

⚙️ Create and run Safe Synthesizer job¶

Create the Safe Synthesizer builder and attach your DataFrame. Enable differential privacy and configure the training and generation stages for optimal performance with DP.

Run the pipeline with run(), which performs data processing, PII replacement, training, generation, evaluation and saving of results in a single call. Results are available on builder.results.

In [ ]:

Copied!





from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

builder = (
    SafeSynthesizer()
    .with_data_source(df)  # .with_replace_pii(enable=False) to disable PII replacement
    .with_differential_privacy(dp_enabled=True, delta=recommended_delta)
    .with_train(batch_size=16)  # Override the default batch size of 1, which is designed for non-DP training
    .with_generate(use_structured_generation=True)  # Improves the percentage of valid records when DP is enabled
)
builder.run()
results = builder.results
from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer

builder = (
    SafeSynthesizer()
    .with_data_source(df)  # .with_replace_pii(enable=False) to disable PII replacement
    .with_differential_privacy(dp_enabled=True, delta=recommended_delta)
    .with_train(batch_size=16)  # Override the default batch size of 1, which is designed for non-DP training
    .with_generate(use_structured_generation=True)  # Improves the percentage of valid records when DP is enabled
)
builder.run()
results = builder.results

📤 Retrieve synthetic data¶

Inspect the generated synthetic data including row count and preview of the first rows.

In [ ]:

Copied!

synth = results.synthetic_data
print(f"Number of synthetic rows: {len(synth)}")
synth.head()
synth = results.synthetic_data
print(f"Number of synthetic rows: {len(synth)}")
synth.head()

In [ ]:

Copied!

# Synthetic data and evaluation report are automatically saved to the artifacts directory
print(f"Artifacts automatically saved to: {builder._workdir.generate.path}")
# Synthetic data and evaluation report are automatically saved to the artifacts directory
print(f"Artifacts automatically saved to: {builder._workdir.generate.path}")

🛡️ Review evaluation report¶

The pipeline computes both quality and privacy metrics. The summary includes timing information and overall scores, while the full evaluation report is rendered as an HTML document.

In [ ]:

Copied!

import json

print("Summary (timing and scores):")
print(json.dumps(results.summary.model_dump(), indent=2))
import json

print("Summary (timing and scores):")
print(json.dumps(results.summary.model_dump(), indent=2))

In [ ]:

Copied!





# View the evaluation report in a sandboxed iframe
import base64
from IPython.display import IFrame, display

report_html = results.evaluation_report_html
if report_html:
    data_url = "data:text/html;base64," + base64.b64encode(report_html.encode()).decode()
    display(IFrame(src=data_url, width="100%", height=800))
# View the evaluation report in a sandboxed iframe
import base64
from IPython.display import IFrame, display

report_html = results.evaluation_report_html
if report_html:
    data_url = "data:text/html;base64," + base64.b64encode(report_html.encode()).decode()
    display(IFrame(src=data_url, width="100%", height=800))