In [1]:

Copied!





# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#   kernelspec:
#     display_name: Python 3
#     language: python
#     name: python3
# ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#   kernelspec:
#     display_name: Python 3
#     language: python
#     name: python3
# ---

🕵️ Your First Anonymization¶

Detect sensitive entities and replace them with LLM-generated substitutes -- the simplest end-to-end example of Anonymizer.

📚 What you'll learn¶

Load a CSV dataset and configure Anonymizer in a few lines
Preview anonymized results on a small sample before committing to a full run
Inspect entity detection and replacement with display_record()
Process the full dataset with run()

Tip: First time running notebooks? Start with setup instructions.

⚙️ Setup¶

Check if your NVIDIA_API_KEY from build.nvidia.com is registered for model access.
- The default build.nvidia.com (NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services.
- Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint for production data and usage.
Import the core Anonymizer classes: Anonymizer, AnonymizerConfig, AnonymizerInput, and Substitute.
Anonymizer() initializes with the default model provider -- no extra config needed.
configure_logging(LoggingConfig.default()) keeps logs at INFO. Switch to LoggingConfig.debug() when troubleshooting.

In [2]:

Copied!





import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key
import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key

In [ ]:

Copied!

from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, Substitute, configure_logging

configure_logging(LoggingConfig.default())
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, Substitute, configure_logging

configure_logging(LoggingConfig.default())

In [4]:

Copied!

anonymizer = Anonymizer()
anonymizer = Anonymizer()

[13:13:46] [INFO] 🔧 Anonymizer initialized with 3 model configs

[13:13:46] [INFO]   |-- 🔎 detector:  gliner-pii-detector

[13:13:46] [INFO]   |-- ✅ validator: gpt-oss-120b

[13:13:46] [INFO]   |-- 🧩 augmenter: gpt-oss-120b

📦 Load data and configure¶

AnonymizerInput points to your CSV and names the text column. data_summary gives the LLM context about the kind of text it will process.
Records up to 2,000 tokens each work with the default model configs.
AnonymizerConfig with Substitute() tells Anonymizer to replace detected entities with LLM-generated synthetic values for names, cities, dates, etc.

In [5]:

Copied!





input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles of individuals",
)

config = AnonymizerConfig(replace=Substitute())
input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles of individuals",
)

config = AnonymizerConfig(replace=Substitute())

👁️ Preview¶

preview() runs on a small sample so you can iterate quickly.
Always preview before processing the full dataset -- it's the fastest way to catch prompt or config issues early.

In [6]:

Copied!

preview = anonymizer.preview(config=config, data=input_data, num_records=3)
preview = anonymizer.preview(config=config, data=input_data, num_records=3)

[13:13:46] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')

[13:13:46] [INFO] 🔍 Running entity detection on 3 records

[13:13:46] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)

[13:14:17] [INFO]   |-- 📋 Detection complete — 80 entities found across 3 records (0 failed) [30.6s]

[13:14:17] [INFO]   |-- labels: first_name=23, state=6, organization_name=6, age=5, occupation=5, city=5, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, education_level=3, field_of_study=2, religious_belief=2, street_address=2, degree=1, university=1, place_name=1, date_of_birth=1, employment_status=1

[13:14:17] [INFO] 🔄 Running Substitute replacement

[13:15:14] [INFO]   |-- 📋 Replacement complete (0 failed) [57.4s]

[13:15:14] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures

🔍 Inspect¶

display_record() shows the original text with highlighted entities, the replacement map, and the anonymized output -- all in one view.
The result dataframe has original and substituted text side-by-side.

In [7]:

Copied!

preview.display_record(0)
preview.display_record(0)

Anonymizer Preview (record 0)

Original

Bobby| first_name Watford| last_name, a 40| age‑year‑old Mexican| race_ethnicity veterinarian| occupation living in Denver| city, Colorado| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High| organization_name, he earned his DVM| degree at the University of Colorado Boulder| university, where he also completed a research stint in wildlife health| field_of_study. Fluent in English| language, Bobby| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Bobby| first_name has worked at VCA Animal Hospital| company_name and later at the Colorado Veterinary Clinic| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Christian Democrat| political_view and often volunteers at local shelters, a habit encouraged by his wife, Maya| first_name, and their two teenage children, Aria| first_name and Leo| first_name. Outside the clinic, Bobby| first_name enjoys hiking the Rockies| place_name with his family and mentoring veterinary students from his alma mater.

Replaced

Ethan| first_name Henderson| last_name, a 45| age‑year‑old Vietnamese| race_ethnicity marine biologist| occupation living in Portland| city, Oregon| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Lincoln High| organization_name, he earned his Ph.D.| degree at the University of Oregon| university, where he also completed a research stint in marine ecology| field_of_study. Fluent in Spanish| language, Ethan| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Ethan| first_name has worked at PetCare Medical Center| company_name and later at the Oregon Animal Wellness Center| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Libertarian| political_view and often volunteers at local shelters, a habit encouraged by his wife, Leah| first_name, and their two teenage children, Sofia| first_name and Noah| first_name. Outside the clinic, Ethan| first_name enjoys hiking the Cascade Range| place_name with his family and mentoring veterinary students from his alma mater.

Replacement Map

Original	Label	Replacement
40	age	45
Aria	first_name	Sofia
Bobby	first_name	Ethan
Christian Democrat	political_view	Libertarian
Colorado	state	Oregon
Colorado Veterinary Clinic	organization_name	Oregon Animal Wellness Center
DVM	degree	Ph.D.
Denver	city	Portland
English	language	Spanish
Jefferson High	organization_name	Lincoln High
Leo	first_name	Noah
Maya	first_name	Leah
Mexican	race_ethnicity	Vietnamese
Rockies	place_name	Cascade Range
University of Colorado Boulder	university	University of Oregon
VCA Animal Hospital	company_name	PetCare Medical Center
Watford	last_name	Henderson
veterinarian	occupation	marine biologist
wildlife health	field_of_study	marine ecology

In [8]:

Copied!

preview.display_record(1)
preview.display_record(1)

Anonymizer Preview (record 1)

Original

Idilio| first_name Bell| last_name is a 37| age‑year‑old astronomer| occupation living in Edison| city, New Jersey| state. Born on November 21, 1988| date_of_birth, he grew up in a bilingual Italian| race_ethnicity household and speaks English| language at home and work. He earned his bachelor’s degree| education_level in physics| field_of_study from the University of New Jersey| state and later completed a PhD in astrophysics| education_level at Princeton| city, where his dissertation focused on exoplanet atmospheres. After graduation he spent three years at NASA| organization_name’s Goddard Space Flight Center| organization_name before joining SpaceX| organization_name’s research division, where he now leads a team analyzing data from the Starlink| organization_name telescope array. Idilio| first_name describes himself as secular| religious_belief and leans progressive| political_view on most political issues, often volunteering for science outreach programs in his community.

Outside the lab, Idilio| first_name shares a modest house on West Roberts Drive| street_address with his wife, Maya| first_name, and their two young daughters, Lina| first_name and Zara| first_name. His mother, Elena| first_name, lives nearby and still cooks the family’s favorite pasta on Sundays, while his father, Marco| first_name, retired| employment_status from an engineering firm| company_name in New York| state. Family gatherings are a mix of lively conversation and stargazing sessions on the backyard deck, where Idilio| first_name points out constellations and tells stories of the cosmos that inspire his children’s curiosity.

Replaced

Santiago| first_name Kumar| last_name is a 36| age‑year‑old geophysicist| occupation living in Austin| city, Texas| state. Born on July 5, 1989| date_of_birth, he grew up in a bilingual Greek| race_ethnicity household and speaks Spanish| language at home and work. He earned his associate’s degree| education_level in chemistry| field_of_study from the University of Oregon| state and later completed a master’s degree in planetary geology| education_level at Portland| city, where his dissertation focused on exoplanet atmospheres. After graduation he spent three years at European Space Agency| organization_name’s National Renewable Energy Laboratory| organization_name before joining Blue Origin| organization_name’s research division, where he now leads a team analyzing data from the OneWeb| organization_name telescope array. Santiago| first_name describes himself as agnostic| religious_belief and leans centrist| political_view on most political issues, often volunteering for science outreach programs in his community.

Outside the lab, Santiago| first_name shares a modest house on North Willow Lane| street_address with his wife, Priya| first_name, and their two young daughters, Aisha| first_name and Nadia| first_name. His mother, Sofia| first_name, lives nearby and still cooks the family’s favorite pasta on Sundays, while his father, Diego| first_name, part-time| employment_status from an architectural studio| company_name in Florida| state. Family gatherings are a mix of lively conversation and stargazing sessions on the backyard deck, where Santiago| first_name points out constellations and tells stories of the cosmos that inspire his children’s curiosity.

Replacement Map

Original	Label	Replacement
37	age	36
Bell	last_name	Kumar
Edison	city	Austin
Elena	first_name	Sofia
English	language	Spanish
Goddard Space Flight Center	organization_name	National Renewable Energy Laboratory
Idilio	first_name	Santiago
Italian	race_ethnicity	Greek
Lina	first_name	Aisha
Marco	first_name	Diego
Maya	first_name	Priya
NASA	organization_name	European Space Agency
New Jersey	state	Texas
New Jersey	state	Oregon
New York	state	Florida
November 21, 1988	date_of_birth	July 5, 1989
PhD in astrophysics	education_level	master’s degree in planetary geology
Princeton	city	Portland
SpaceX	organization_name	Blue Origin
Starlink	organization_name	OneWeb
West Roberts Drive	street_address	North Willow Lane
Zara	first_name	Nadia
astronomer	occupation	geophysicist
bachelor’s degree	education_level	associate’s degree
engineering firm	company_name	architectural studio
in physics	field_of_study	in chemistry
progressive	political_view	centrist
retired	employment_status	part-time
secular	religious_belief	agnostic

In [9]:

Copied!

preview.dataframe
preview.dataframe

Out[9]:

	biography	biography_with_spans	final_entities	biography_replaced
0	Bobby Watford, a 40‑year‑old Mexican veterinar...	<first_name>Bobby</first_name> <last_name>Watf...	{'entities': [{'end_position': 5, 'id': 'first...	Ethan Henderson, a 45‑year‑old Vietnamese mari...
1	Idilio Bell is a 37‑year‑old astronomer living...	<first_name>Idilio</first_name> <last_name>Bel...	{'entities': [{'end_position': 6, 'id': 'first...	Santiago Kumar is a 36‑year‑old geophysicist l...
2	Jodi Allison, 36, lives at 204 Bluegrass in Cl...	<first_name>Jodi</first_name> <last_name>Allis...	{'entities': [{'end_position': 4, 'id': 'first...	Sofia Keller, 42, lives at 587 Maple in Macon,...

🚀 Full run¶

run() processes the entire dataset with the same config you previewed.
Access the output via result.dataframe.

In [10]:

Copied!

result = anonymizer.run(config=config, data=input_data)
print(result)
result = anonymizer.run(config=config, data=input_data)
print(result)

[13:15:14] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')

[13:15:14] [INFO] 🔍 Running entity detection on 25 records

[13:15:14] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)

[13:16:05] [INFO]   |-- 📋 Detection complete — 648 entities found across 25 records (0 failed) [50.3s]

[13:16:05] [INFO]   |-- labels: first_name=152, city=48, occupation=45, company_name=40, education_level=33, race_ethnicity=31, state=30, organization_name=30, last_name=27, age=26, political_view=26, religious_belief=25, street_address=23, university=21, language=21, field_of_study=13, place_name=12, county=11, employment_status=10, date_of_birth=9, date=5, degree=4, school_name=1, landmark=1, journal_name=1, country=1, gender=1, postcode=1

[13:16:05] [INFO] 🔄 Running Substitute replacement

[13:16:35] [INFO]   |-- 📋 Replacement complete (0 failed) [30.5s]

[13:16:35] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures

AnonymizerResult(rows=25, columns=4, trace_columns=21, failed_records=0)

In [11]:

Copied!

result.dataframe.head()
result.dataframe.head()

Out[11]:

	biography	biography_with_spans	final_entities	biography_replaced
0	Bobby Watford, a 40‑year‑old Mexican veterinar...	<first_name>Bobby</first_name> <last_name>Watf...	{'entities': array([{'end_position': 5, 'id': ...	Ethan Hernandez, a 52‑year‑old Filipino zoolog...
1	Idilio Bell is a 37‑year‑old astronomer living...	<first_name>Idilio</first_name> <last_name>Bel...	{'entities': array([{'end_position': 6, 'id': ...	Rafael Khan is a 42‑year‑old planetary geologi...
2	Jodi Allison, 36, lives at 204 Bluegrass in Cl...	<first_name>Jodi</first_name> <last_name>Allis...	{'entities': array([{'end_position': 4, 'id': ...	Leah Harper, 42, lives at 204 Willow in Eugene...
3	James Mills is a 69‑year‑old paramedic who liv...	<first_name>James</first_name> <last_name>Mill...	{'entities': array([{'end_position': 5, 'id': ...	Ethan Harper is a 71‑year‑old firefighter who ...
4	Nancy Burton is a 21‑year‑old cashier who live...	<first_name>Nancy</first_name> <last_name>Burt...	{'entities': array([{'end_position': 5, 'id': ...	Leah Hawkins is a 27‑year‑old stock clerk who ...

📊 (Optional) Evaluate replacement quality¶

evaluate() is a separate, opt-in step that scores the output with LLM-as-judge metrics.
For Substitute, all four metrics run: Detection Validity, Type Fidelity, Relational Consistency, Attribute Fidelity.
Skip it for routine runs; call it when you want LLM-side confidence on the output. Costs LLM calls per record, so try it on preview first.

In [ ]:

Copied!

evaluated = anonymizer.evaluate(preview)
evaluated.display_record(0)
evaluated = anonymizer.evaluate(preview)
evaluated.display_record(0)

⏭️ Next steps¶

🔍 Inspecting Detected Entities -- dig into what the detection pipeline found and debug quality.
🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.