NeMo Safe Synthesizer Tutorial: Time-Series Financial Transactions¶
What you'll learn¶
In this notebook, we'll explore how to use NeMo Safe Synthesizer for generating a financial transaction time series: loading an account-level transaction dataset, configuring the synthesizer for grouped temporal data, generating synthetic transaction histories, and checking whether the synthetic data preserves privacy and transaction patterns.
We will answer two questions:
- Does the synthetic data avoid direct memorization of transaction rows?
- Does it preserve the data structure and behavioral patterns that matter for financial analytics?
A full run takes about 20 minutes on an A100. If you have not yet completed the Safe Synthesizer 101 tutorial, consider starting there first.
Prerequisites¶
This notebook requires:
- A Linux machine with an NVIDIA GPU (H100 recommended, A100 minimum) and CUDA 12.9+. It will not run on macOS, Windows, or Apple Silicon.
- The
docs/tutorials/datasets/financial_transactions_DD_0526.csvdataset. - An optional
NSS_INFERENCE_KEYfor PII column classification.
Install Safe Synthesizer¶
Run the cell below to install NeMo Safe Synthesizer with the engine and CUDA 12.9 extras.
If NeMo Safe Synthesizer is already installed in your notebook environment, you can skip this cell.
%%bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
if command -v uv > /dev/null 2>&1; then
uv pip install "nemo-safe-synthesizer[engine,cu129]" --index https://flashinfer.ai/whl/cu129 --index https://download.pytorch.org/whl/cu129 --index https://wheels.vllm.ai/88d34c6409e9fb3c7b8ca0c04756f061d2099eb1/cu129 --index-strategy unsafe-best-match
else
pip install "nemo-safe-synthesizer[engine,cu129]" --extra-index-url https://flashinfer.ai/whl/cu129 --extra-index-url https://download.pytorch.org/whl/cu129 --extra-index-url https://wheels.vllm.ai/88d34c6409e9fb3c7b8ca0c04756f061d2099eb1/cu129
fi
Set the Inference API Key for PII Column Classification¶
This notebook enables PII replacement through the SDK builder. NeMo Safe Synthesizer can use an LLM-based column classifier to infer PII columns; set NSS_INFERENCE_KEY to enable that classifier. You can obtain an API key from NVIDIA Build. You can press Enter to skip, but the privacy story is stronger when PII classification is available.
import getpass
import os
if "NSS_INFERENCE_KEY" not in os.environ:
os.environ["NSS_INFERENCE_KEY"] = getpass.getpass("Paste inference API key (or press Enter to skip): ")
if os.environ.get("NSS_INFERENCE_KEY"):
print("NSS_INFERENCE_KEY is set")
else:
print(
"NSS_INFERENCE_KEY is not set. PII replacement may run in degraded mode. "
"For a privacy-focused demo, set a key and rerun this cell."
)
from __future__ import annotations
import base64
import hashlib
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
from IPython.display import IFrame, Markdown, display
NVIDIA_GREEN = "#76B900"
NVIDIA_DARK = "#111111"
NVIDIA_GRAY = "#7A7A7A"
NVIDIA_HEATMAP = LinearSegmentedColormap.from_list(
"nvidia_heatmap",
["#F7F7F7", "#D7F2B0", NVIDIA_GREEN, "#254A00"],
)
plt.rcParams.update({
"figure.figsize": (11, 5),
"axes.grid": True,
"grid.alpha": 0.25,
"axes.spines.top": False,
"axes.spines.right": False,
"axes.prop_cycle": plt.cycler(color=[NVIDIA_GREEN, NVIDIA_DARK, NVIDIA_GRAY]),
})
pd.set_option("display.max_columns", 80)
pd.set_option("display.width", 160)
Load the Financial Transactions Dataset¶
Load financial_transactions_DD_0526.csv and confirm the expected row and column counts. The dataset is already in the long format that NeMo Safe Synthesizer expects for time-series synthesis: rows sharing the same acct_id form one transaction sequence, and txn_index provides the ordering within each account.
ORIGINAL_CSV = "https://raw.githubusercontent.com/NVIDIA-NeMo/Safe-Synthesizer/main/docs/tutorials/datasets/financial_transactions_DD_0526.csv"
ARTIFACT_ROOT = "safe-synthesizer-artifacts"
source_df = pd.read_csv(ORIGINAL_CSV)
pd.DataFrame(
[{
"input": "Original CSV",
"path": ORIGINAL_CSV,
"rows": len(source_df),
"columns": len(source_df.columns),
}]
)
Preview the Original Dataset¶
The source data is a transaction ledger: repeated account-level fields, a chronological transaction index, merchant category/name, timestamp, and amount. Direct identifiers such as acct_id and cardholder are useful for demonstrating privacy checks, but should be treated carefully in production synthetic-data workflows.
display(source_df.head(5))
display(pd.DataFrame({
"column": source_df.columns,
"dtype": source_df.dtypes.astype(str).values,
"non_null": source_df.notna().sum().values,
"unique_values": source_df.nunique(dropna=True).values,
}))
Configure and Run Safe Synthesizer¶
Create the Safe Synthesizer builder and configure it for financial transaction time-series synthesis:
with_time_seriesenables time-series mode and usestxn_indexas the sequence timestamp.with_datatells the synthesizer that rows sharing the sameacct_idbelong to one account history, ordered bytxn_index.with_replace_piienables PII replacement, while preserving two-letter state abbreviations as categorical values.with_trainsets demo-specific training hyperparameters for theHuggingFaceTB/SmolLM3-3Bbase model.
In time-series mode, generation follows the learned group and timestamp structure rather than treating num_records as a row-count target.
Refer to the configuration docs for the full list of options.
from nemo_safe_synthesizer.config.replace_pii import PiiReplacerConfig
from nemo_safe_synthesizer.sdk.library_builder import SafeSynthesizer
pii_config = PiiReplacerConfig.get_default_config()
pii_config.globals.classify.entities = [
entity for entity in pii_config.globals.classify.entities if entity != "state"
]
if pii_config.globals.ner.ner_entities is not None:
pii_config.globals.ner.ner_entities = [
entity for entity in pii_config.globals.ner.ner_entities if entity != "state"
]
builder = (
SafeSynthesizer(save_path=ARTIFACT_ROOT)
.with_data_source(source_df)
.with_time_series(
is_timeseries=True,
timestamp_column="txn_index",
)
.with_data(
holdout=0,
group_training_examples_by="acct_id",
order_training_examples_by="txn_index",
)
.with_replace_pii(config=pii_config)
.with_train(
pretrained_model="HuggingFaceTB/SmolLM3-3B",
num_input_records_to_sample=60_000,
learning_rate=5.0e-4,
lora_r=32,
)
)
builder.run()
results = builder.results
RUN_PATH = Path(builder._workdir.generate.path).parent
pd.DataFrame(
[{
"artifact": "Safe Synthesizer run path",
"path": str(RUN_PATH),
"synthetic_rows": len(results.synthetic_data),
}]
)
Review the Built-In Evaluation Report¶
Safe Synthesizer generates a built-in evaluation report as part of builder.run(). Review this first for the standard quality and privacy assessment before moving into the transaction-specific analysis below.
report_html = results.evaluation_report_html
if report_html:
data_url = "data:text/html;base64," + base64.b64encode(report_html.encode("utf-8")).decode("ascii")
display(IFrame(src=data_url, width="100%", height=800))
else:
display(Markdown("Safe Synthesizer did not return an HTML evaluation report for this run."))
Retrieve Synthetic Data¶
After the run completes, Safe Synthesizer exposes the generated dataset on builder.results. The rest of the notebook uses this in-memory result directly.
synthetic_output = results.synthetic_data.copy()
print(f"Number of synthetic rows: {len(synthetic_output):,}")
display(synthetic_output.head())
print(f"Artifacts saved to: {RUN_PATH}")
Prepare Data for Analysis¶
Safe Synthesizer generated a time-series dataset. The result can include padded continuation rows where the account-level context is present but transaction detail fields are empty. We keep the raw DataFrames for schema and missingness checks, then create transaction-valid views for behavioral analysis.
def prepare_transactions(df: pd.DataFrame, source: str) -> pd.DataFrame:
prepared = df.copy()
prepared["source"] = source
prepared["txn_amount"] = pd.to_numeric(prepared["txn_amount"], errors="coerce")
prepared["txn_index"] = pd.to_numeric(prepared["txn_index"], errors="coerce")
prepared["timestamp_dt"] = pd.to_datetime(prepared["timestamp"], errors="coerce")
prepared["hour"] = prepared["timestamp_dt"].dt.hour
prepared["day"] = prepared["timestamp_dt"].dt.day
return prepared
original_raw = prepare_transactions(source_df, "Original")
synthetic_raw = prepare_transactions(synthetic_output, "Synthetic")
transaction_detail_cols = ["timestamp", "merchant_cat", "merchant", "txn_amount"]
original = original_raw.dropna(subset=transaction_detail_cols).copy()
synthetic = synthetic_raw.dropna(subset=transaction_detail_cols).copy()
combined = pd.concat([original, synthetic], ignore_index=True)
raw_combined = pd.concat([original_raw, synthetic_raw], ignore_index=True)
summary = pd.DataFrame([
{
"source": name,
"valid_transaction_rows": len(clean),
"unique_accounts": raw["acct_id"].nunique(),
"valid_rows_per_account_median": clean.groupby("acct_id").size().median(),
}
for name, raw, clean in [
("Original", original_raw, original),
("Synthetic", synthetic_raw, synthetic),
]
])
summary
CATEGORY_ORDER = [
"grocery", "dining", "retail", "e-commerce", "subscription", "entertainment",
"gas", "healthcare", "utilities", "automotive", "ATM", "travel",
"wire_transfer", "other",
]
def pct_by_category(df: pd.DataFrame) -> pd.Series:
return df["merchant_cat"].value_counts(normalize=True).reindex(CATEGORY_ORDER, fill_value=0)
def row_hashes(df: pd.DataFrame, columns: list[str]) -> set[str]:
normalized = df[columns].fillna("<NA>").astype(str)
return {
hashlib.sha256("||".join(row).encode("utf-8")).hexdigest()
for row in normalized.to_numpy()
}
def plot_grouped_bars(frame: pd.DataFrame, title: str, ylabel: str, *, ax=None):
ax = frame.plot(kind="bar", ax=ax, width=0.82)
ax.set_title(title)
ax.set_ylabel(ylabel)
ax.set_xlabel("")
ax.tick_params(axis="x", rotation=45)
ax.legend(frameon=False)
return ax
def valid_rate(raw: pd.DataFrame) -> float:
return raw[transaction_detail_cols].notna().all(axis=1).mean()
Visual Comparison: Category Mix¶
A useful synthetic transaction dataset should preserve the category base rates: grocery and dining should be common, wire transfers should remain rare, and the middle-frequency categories should stay in the same neighborhood.
category_mix = pd.DataFrame({
"Original": pct_by_category(original) * 100,
"Synthetic": pct_by_category(synthetic) * 100,
})
category_delta = category_mix.assign(
synthetic_minus_original_pp=category_mix["Synthetic"] - category_mix["Original"],
)
fig, ax = plt.subplots(figsize=(13, 5))
plot_grouped_bars(category_mix, "Merchant Category Distribution", "Share of valid transactions (%)", ax=ax)
plt.tight_layout()
plt.show()
Visual Comparison: Amount Distributions¶
Financial utility often depends on tails and conditional distributions, not just averages. The plots below use a log-scaled amount axis so daily spend, healthcare/procedure amounts, travel, and wire transfers can be inspected together.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
bins = np.logspace(np.log10(max(1, combined["txn_amount"].min())), np.log10(combined["txn_amount"].max()), 60)
for source, frame in combined.groupby("source"):
axes[0].hist(frame["txn_amount"], bins=bins, density=True, histtype="step", linewidth=2, label=source)
axes[0].set_xscale("log")
axes[0].set_title("Overall Transaction Amount Distribution")
axes[0].set_xlabel("transaction amount, log scale")
axes[0].set_ylabel("density")
axes[0].legend(frameon=False)
amount_summary = (
combined.groupby(["source", "merchant_cat"])["txn_amount"]
.median()
.unstack(0)
.reindex(CATEGORY_ORDER)
)
amount_summary.plot(kind="bar", ax=axes[1], width=0.82)
axes[1].set_title("Median Amount by Merchant Category")
axes[1].set_yscale("log")
axes[1].set_ylabel("median transaction amount, log scale")
axes[1].set_xlabel("")
axes[1].tick_params(axis="x", rotation=45)
axes[1].legend(frameon=False)
plt.tight_layout()
plt.show()
display(
combined.groupby("source")["txn_amount"]
.describe(percentiles=[0.5, 0.9, 0.95, 0.99])
.round(2)
)
Visual Comparison: Time-of-Day Patterns¶
The source data contains strong temporal priors: dining peaks later than healthcare, entertainment is mostly evening, and subscriptions can occur overnight. The heatmaps below compare whether those category-specific temporal signatures survived synthesis.
def hour_category_matrix(df: pd.DataFrame) -> pd.DataFrame:
mat = (
df.pivot_table(index="merchant_cat", columns="hour", values="txn_amount", aggfunc="size", fill_value=0)
.reindex(index=CATEGORY_ORDER, columns=range(24), fill_value=0)
)
return mat.div(mat.sum(axis=1).replace(0, np.nan), axis=0).fillna(0)
orig_hour = hour_category_matrix(original)
synth_hour = hour_category_matrix(synthetic)
fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True)
for ax, mat, title in [(axes[0], orig_hour, "Original"), (axes[1], synth_hour, "Synthetic")]:
im = ax.imshow(mat.values, aspect="auto", cmap=NVIDIA_HEATMAP, vmin=0, vmax=max(orig_hour.max().max(), synth_hour.max().max()))
ax.set_title(f"{title}: Category Hour-of-Day Profile")
ax.set_xlabel("hour of day")
ax.set_xticks(range(0, 24, 2))
ax.set_yticks(range(len(CATEGORY_ORDER)))
ax.set_yticklabels(CATEGORY_ORDER)
fig.colorbar(im, ax=axes, shrink=0.8, label="within-category share")
plt.show()
hour_means = combined.groupby(["source", "merchant_cat"])["hour"].mean().unstack(0).reindex(CATEGORY_ORDER)
hour_means["synthetic_minus_original"] = hour_means["Synthetic"] - hour_means["Original"]
Interpreting Results and Next Steps¶
NeMo Safe Synthesizer produced novel synthetic rows and transaction sequences while preserving statistical patterns in the source data. The synthetic dataset is best understood as another sample from the same broader transaction population. Individual values will differ from the source sample, but the category mix, timing behavior, and amount distributions should remain within a useful range.
That is the practical promise of safe synthetic data: not a perfect clone, and not random fake data, but a privacy-aware substitute that retains enough signal for meaningful development, analysis, and model experimentation.
On your own time-series dataset, use the same workflow: configure grouping and ordering, generate synthetic sequences, review the built-in report, and add domain-specific visual checks for the patterns that matter most. See the configuration docs for the full parameter reference.