🕵️ Your First Anonymization¶
Detect sensitive entities and replace them with LLM-generated substitutes -- the simplest end-to-end example of Anonymizer.
📚 What you'll learn¶
- Load a CSV dataset and configure Anonymizer in a few lines
- Preview anonymized results on a small sample before committing to a full run
- Inspect entity detection and replacement with
display_record() - Process the full dataset with
run()
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access. - Import the core Anonymizer classes:
Anonymizer,AnonymizerConfig,AnonymizerInput, andSubstitute. Anonymizer()initializes with the default model provider -- no extra config needed.
In [2]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [3]:
Copied!
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
In [4]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:31:16] [INFO] 🔧 Anonymizer initialized with 3 model configs
[13:31:16] [INFO] |-- 🔎 detector: gliner-pii-detector
[13:31:16] [INFO] |-- ✅ validator: gpt-oss-120b
[13:31:16] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Load data and configure¶
AnonymizerInputpoints to your CSV and names the text column.data_summarygives the LLM context about the kind of text it will process.- Records up to 2,000 tokens each work with the default model configs.
AnonymizerConfigwithSubstitute()tells Anonymizer to replace detected entities with LLM-generated synthetic values for names, cities, dates, etc.
In [5]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles of individuals",
)
config = AnonymizerConfig(replace=Substitute())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles of individuals",
)
config = AnonymizerConfig(replace=Substitute())
👁️ Preview¶
preview()runs on a small sample so you can iterate quickly.- Always preview before processing the full dataset -- it's the fastest way to catch prompt or config issues early.
In [6]:
Copied!
preview = anonymizer.preview(config=config, data=input_data, num_records=3)
preview = anonymizer.preview(config=config, data=input_data, num_records=3)
[13:31:16] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:31:16] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:31:16] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:31:16] [INFO] 🔍 Running entity detection on 3 records
[13:32:00] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [44.7s]
[13:32:00] [INFO] |-- labels: first_name=22, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, project_name=1, employment_status=1
[13:32:00] [INFO] 🔄 Running Substitute replacement
[13:32:20] [INFO] |-- 📋 Replacement complete (0 failed) [19.8s]
[13:32:20] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🔍 Inspect¶
display_record()shows the original text with highlighted entities, the replacement map, and the anonymized output -- all in one view.- The result dataframe has original and substituted text side-by-side.
In [7]:
Copied!
preview.display_record(0)
preview.display_record(0)
In [8]:
Copied!
preview.display_record(1)
preview.display_record(1)
In [9]:
Copied!
preview.dataframe
preview.dataframe
Out[9]:
| biography | biography_with_spans | final_entities | biography_replaced | |
|---|---|---|---|---|
| 0 | Bobby Watford, a 40‑year‑old Mexican veterinar... | <first_name>Bobby</first_name> <last_name>Watf... | {'entities': [{'end_position': 5, 'id': 'first... | Ethan Kline, a 45‑year‑old Vietnamese wildlife... |
| 1 | Idilio Bell is a 37‑year‑old astronomer living... | <first_name>Idilio</first_name> <last_name>Bel... | {'entities': [{'end_position': 6, 'id': 'first... | Santiago Kumar is a 42‑year‑old planetary scie... |
| 2 | Jodi Allison, 36, lives at 204 Bluegrass in Cl... | <first_name>Jodi</first_name> <last_name>Allis... | {'entities': [{'end_position': 4, 'id': 'first... | Leah Keller, 42, lives at 317 Maplewood in Ale... |
🚀 Full run¶
run()processes the entire dataset with the same config you previewed.- Access the output via
result.dataframe.
In [10]:
Copied!
result = anonymizer.run(config=config, data=input_data)
print(result)
result = anonymizer.run(config=config, data=input_data)
print(result)
[13:32:20] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:32:20] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:32:20] [INFO] 🔍 Running entity detection on 25 records
[13:37:23] [INFO] |-- 📋 Detection complete — 666 entities found across 25 records (0 failed) [303.2s]
[13:37:23] [INFO] |-- labels: first_name=154, organization_name=62, occupation=47, city=41, university=36, field_of_study=34, race_ethnicity=30, last_name=27, state=27, age=26, degree=25, political_view=25, religious_belief=25, street_address=23, language=19, place_name=15, employment_status=11, county=11, date_of_birth=9, education_level=7, date=5, company_name=4, country=1, gender=1, postcode=1
[13:37:23] [INFO] 🔄 Running Substitute replacement
/Users/lramaswamy/Documents/github/Anonymizer/src/anonymizer/engine/row_partitioning.py:42: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation. pd.concat(list(parts), ignore_index=True) [13:39:33] [INFO] |-- 📋 Replacement complete (0 failed) [129.3s]
[13:39:33] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures
AnonymizerResult(rows=25, columns=4, trace_columns=21, failed_records=0)
In [11]:
Copied!
result.dataframe.head()
result.dataframe.head()
Out[11]:
| biography | biography_with_spans | final_entities | biography_replaced | |
|---|---|---|---|---|
| 0 | Bobby Watford, a 40‑year‑old Mexican veterinar... | <first_name>Bobby</first_name> <last_name>Watf... | {'entities': array([{'end_position': 5, 'id': ... | Ethan Hawthorne, a 45‑year‑old Vietnamese mari... |
| 1 | Idilio Bell is a 37‑year‑old astronomer living... | <first_name>Idilio</first_name> <last_name>Bel... | {'entities': array([{'end_position': 6, 'id': ... | Mateo Kline is a 36‑year‑old geophysicist livi... |
| 2 | Jodi Allison, 36, lives at 204 Bluegrass in Cl... | <first_name>Jodi</first_name> <last_name>Allis... | {'entities': array([{'end_position': 4, 'id': ... | Leah Harper, 42, lives at 312 Magnolia in Sava... |
| 3 | James Mills is a 69‑year‑old paramedic who liv... | <first_name>James</first_name> <last_name>Mill... | {'entities': array([{'end_position': 5, 'id': ... | Robert Harper is a 71‑year‑old firefighter who... |
| 4 | Nancy Burton is a 21‑year‑old cashier who live... | <first_name>Nancy</first_name> <last_name>Burt... | {'entities': array([{'end_position': 5, 'id': ... | Aisha Khan is a 22‑year‑old stock clerk who li... |
⏭️ Next steps¶
- 🔍 Inspecting Detected Entities -- dig into what the detection pipeline found and debug quality.
- 🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
- ✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.