🕵️ Rewriting Biographies¶

Instead of replacing entities with tokens, rewrite mode generates a privacy-safe transformation of the entire text. The pipeline:

Detects entities (same as replace mode, plus latent entity detection)
Classifies the domain and assigns sensitivity dispositions
Generates a rewritten version that obscures sensitive entities
Evaluates quality (utility) and privacy (leakage) with an automated repair loop
Runs a final LLM judge for informational scores

📚 What you'll learn¶

Configure rewrite mode with PrivacyGoal to specify what to protect and what to preserve
Set evaluation criteria and risk tolerance for automated quality checks
Preview rewritten text and inspect utility / leakage scores
Triage flagged records with needs_human_review

Tip: First time running notebooks? Start with setup instructions.

⚙️ Setup¶

Check if your NVIDIA_API_KEY from build.nvidia.com is registered for model access.
- Treat the default build.nvidia.com setup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data.
- Request/token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full runs. Start with preview() on a small sample.
Import Rewrite and PrivacyGoal.
Anonymizer() initializes with the default model provider -- no extra config needed.
Anonymizer.configure_logging() controls verbosity -- switch to Anonymizer.configure_logging(LoggingConfig.debug()) when troubleshooting.

In [1]:

Copied!





import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key
import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key

In [2]:

Copied!

from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal

configure_logging(LoggingConfig.default())
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal

configure_logging(LoggingConfig.default())

In [3]:

Copied!

anonymizer = Anonymizer()
anonymizer = Anonymizer()

[16:15:48] [INFO] 🔧 Anonymizer initialized with 3 model configs
[16:15:48] [INFO]   |-- 🔎 detector:  gliner-pii-detector
[16:15:48] [INFO]   |-- ✅ validator: gpt-oss-120b
[16:15:48] [INFO]   |-- 🧩 augmenter: gpt-oss-120b

📦 Input data¶

Same biographies dataset used in earlier notebooks -- familiar data makes it easy to compare rewrite output against replace output.

In [4]:

Copied!





input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

🎛️ Configure¶

PrivacyGoal spells out what to protect and what to preserve -- this gives the rewriter clear, domain-specific guidance.
risk_tolerance (default "low") and max_repair_iterations (default 3) control the automated quality gate -- see Risk tolerance for presets.

In [5]:

Copied!





config = AnonymizerConfig(
    rewrite=Rewrite(
        privacy_goal=PrivacyGoal(
            protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
            preserve="Career trajectory, educational background, and professional accomplishments",
        ),
    ),
)
config = AnonymizerConfig(
    rewrite=Rewrite(
        privacy_goal=PrivacyGoal(
            protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
            preserve="Career trajectory, educational background, and professional accomplishments",
        ),
    ),
)

👁️ Preview¶

preview() runs on a small sample so you can iterate on privacy goals and evaluation criteria before committing to a full run.

In [6]:

Copied!





preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

preview.display_record(0)
preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

preview.display_record(0)

[16:16:07] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[16:16:07] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[16:16:07] [INFO]   |-- 👀 Preview mode: processing 3 of 25 records
[16:16:07] [INFO] 🔍 Running entity detection on 3 records

[16:17:53] [INFO]   |-- 📋 Detection complete — 80 entities found across 3 records (0 failed) [105.8s]
[16:17:53] [INFO]   |-- labels: first_name=23, organization_name=9, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1
[16:17:53] [INFO] ✏️ Running rewrite pipeline
[16:21:15] [INFO] Evaluate-repair loop: all rows pass at iteration 0
[16:21:41] [INFO]   |-- 📋 Rewrite complete (0 failed) [228.2s]
[16:21:41] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures

Anonymizer Rewrite Preview (record 0)

Original

Bobby| first_name Watford| last_name, a 40| age‑year‑old Mexican| race_ethnicity veterinarian| occupation living in Denver| city, Colorado| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High| organization_name, he earned his DVM| degree at the University of Colorado Boulder| university, where he also completed a research stint in wildlife health| field_of_study. Fluent in English| language, Bobby| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Bobby| first_name has worked at VCA Animal Hospital| organization_name and later at the Colorado Veterinary Clinic| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Christian Democrat| political_view and often volunteers at local shelters, a habit encouraged by his wife, Maya| first_name, and their two teenage children, Aria| first_name and Leo| first_name. Outside the clinic, Bobby| first_name enjoys hiking the Rockies| place_name with his family and mentoring veterinary students from his alma mater.

Rewritten

Diego Hawthorne, a veterinarian in his early 40s from a Hispanic background, living in a city in the Rocky Mountain region of a western U.S. state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from a local high school, he earned a veterinary degree at a state university in the region, where he also completed a research stint in wildlife health. Fluent in English, Diego has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Diego has worked at a veterinary hospital and later at a regional veterinary clinic, where he now leads a busy mixed‑practice team. He identifies with a moderate political view and often volunteers at local shelters, a habit encouraged by his wife, Sofia, and their two teenage children, Nina and Mateo. Outside the clinic, Diego enjoys hiking the Rockies with his family and mentoring veterinary students from his alma mater.

Scores

Utility: 0.85Leakage: 0.54Weighted Leakage Rate: 0.04Needs Review: NoJudge: privacy: 9/10, quality: 9/10, naturalness: 9/10

Entity Disposition

Entity	Label	Sensitivity	Protection
40	age	medium	generalize
Aria	first_name	high	replace
Bobby	first_name	high	replace
Christian Democrat	political_view	high	generalize
Colorado	state	medium	generalize
Colorado Veterinary Clinic	organization_name	medium	generalize
DVM	degree	medium	generalize
Denver	city	medium	generalize
English	language	low	leave_as_is
Jefferson High	organization_name	medium	generalize
Leo	first_name	high	replace
Maya	first_name	high	replace
Mexican	race_ethnicity	high	generalize
Rockies	place_name	low	leave_as_is
University of Colorado Boulder	university	medium	generalize
VCA Animal Hospital	organization_name	medium	generalize
Watford	last_name	high	replace
veterinarian	occupation	low	leave_as_is
wildlife health	field_of_study	low	leave_as_is
married	marital_status	low	leave_as_is
male	gender	low	leave_as_is
doctoral	education_level	low	leave_as_is
practice_leader	current_role	low	leave_as_is
Christian	religious_affiliation	high	generalize

In [7]:

Copied!

preview.display_record(1)
preview.display_record(1)

Anonymizer Rewrite Preview (record 1)

Original

Idilio| first_name Bell| last_name is a 37| age‑year‑old astronomer| occupation living in Edison| city, New Jersey| state. Born on November 21, 1988| date_of_birth, he grew up in a bilingual Italian| race_ethnicity household and speaks English| language at home and work. He earned his bachelor’s degree| degree in physics| field_of_study from the University of New Jersey| university and later completed a PhD| degree in astrophysics| field_of_study at Princeton| university, where his dissertation focused on exoplanet atmospheres. After graduation he spent three years at NASA’s Goddard Space Flight Center| organization_name before joining SpaceX| organization_name’s research division, where he now leads a team analyzing data from the Starlink telescope array| organization_name. Idilio| first_name describes himself as secular| religious_belief and leans progressive| political_view on most political issues, often volunteering for science outreach programs in his community.

Outside the lab, Idilio| first_name shares a modest house on West Roberts Drive| street_address with his wife, Maya| first_name, and their two young daughters, Lina| first_name and Zara| first_name. His mother, Elena| first_name, lives nearby and still cooks the family’s favorite pasta on Sundays, while his father, Marco| first_name, retired| employment_status from an engineering firm| organization_name in New York| state. Family gatherings are a mix of lively conversation and stargazing sessions on the backyard deck, where Idilio| first_name points out constellations and tells stories of the cosmos that inspire his children’s curiosity.

Rewritten

Mateo Khan is an astronomer in their late 30s living in a city in the Mid-Atlantic region, a state in the Northeastern United States. Born in 1988, they grew up in a bilingual European household and speaks English at home and work. They earned a bachelor’s degree in physics from the University of Massachusetts and later completed a PhD in astrophysics at Harvard University, where their dissertation focused on exoplanet atmospheres. After graduation they spent three years at European Space Agency’s ESTEC before joining Blue Origin’s research division, where they now lead a team analyzing data from the Lunar Reconnaissance Satellite network. Mateo describes themselves as secular and is active in science outreach programs in their community.

Outside the lab, Mateo shares a modest house on East Monroe Avenue with their spouse, Aisha, and their two young daughters, Nina and Leila. Their mother, Sofia, lives nearby and still cooks the family’s favorite pasta on Sundays, while their father, Diego, retired from an engineering firm in a state in the Northeastern United States. Family gatherings are a mix of lively conversation and stargazing sessions on the backyard deck, where Mateo points out constellations and tells stories of the cosmos that inspire their children’s curiosity.

Scores

Utility: 0.77Leakage: 0.00Weighted Leakage Rate: 0.00Needs Review: NoJudge: privacy: 9/10, quality: 9/10, naturalness: 9/10

Entity Disposition

Entity	Label	Sensitivity	Protection
37	age	medium	generalize
Bell	last_name	high	replace
Edison	city	medium	generalize
Elena	first_name	high	replace
English	language	low	leave_as_is
Idilio	first_name	high	replace
Italian	race_ethnicity	medium	generalize
Lina	first_name	high	replace
Marco	first_name	high	replace
Maya	first_name	high	replace
NASA’s Goddard Space Flight Center	organization_name	high	replace
New Jersey	state	medium	generalize
New York	state	medium	generalize
November 21, 1988	date_of_birth	medium	generalize
PhD	degree	low	leave_as_is
Princeton	university	medium	replace
SpaceX	organization_name	high	replace
Starlink telescope array	organization_name	medium	replace
University of New Jersey	university	medium	replace
West Roberts Drive	street_address	high	replace
Zara	first_name	high	replace
astronomer	occupation	low	leave_as_is
bachelor’s degree	degree	low	leave_as_is
engineering firm	organization_name	low	leave_as_is
in astrophysics	field_of_study	low	leave_as_is
physics	field_of_study	low	leave_as_is
progressive	political_view	high	remove
retired	employment_status	low	leave_as_is
secular	religious_belief	low	leave_as_is
male	gender	low	suppress_inference
married	marital_status	low	leave_as_is
2	num_children	low	leave_as_is
lead researcher	position	low	leave_as_is
science_outreach_volunteer	community_engagement	low	leave_as_is

🚀 Full run¶

result.dataframe has user-facing columns: rewritten text, scores, and the review flag.
result.trace_dataframe has every intermediate column for debugging.

In [8]:

Copied!

result = anonymizer.run(config=config, data=input_data)

result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)

result.dataframe.head()

[16:22:12] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[16:22:12] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[16:22:12] [INFO] 🔍 Running entity detection on 25 records
[16:29:31] [INFO]   |-- 📋 Detection complete — 667 entities found across 25 records (0 failed) [438.9s]
[16:29:31] [INFO]   |-- labels: first_name=155, organization_name=64, occupation=47, city=40, university=36, field_of_study=34, race_ethnicity=30, last_name=27, age=26, state=25, degree=25, political_view=25, religious_belief=24, street_address=23, language=20, place_name=17, county=10, date_of_birth=9, employment_status=9, education_level=7, date=5, company_name=4, date_time=1, landmark=1, country=1, gender=1, postcode=1
[16:29:31] [INFO] ✏️ Running rewrite pipeline
[16:50:53] [INFO] Evaluate-repair loop iteration 0: 8/25 rows need repair
[16:52:53] [INFO] Evaluate-repair loop iteration 1: 1/25 rows need repair
[16:54:14] [INFO]   |-- 📋 Rewrite complete (0 failed) [1482.5s]
[16:54:14] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures

Out[8]:

	biography	biography_rewritten	utility_score	leakage_mass	weighted_leakage_rate	any_high_leaked	needs_human_review
0	Bobby Watford, a 40‑year‑old Mexican veterinar...	Ethan Hawthorne, a middle-aged Hispanic veteri...	0.814286	0.0	0.0	False	False
1	Idilio Bell is a 37‑year‑old astronomer living...	Rafael Kumar is in his late 30s, an astronomer...	0.935714	0.9	0.050279	False	False
2	Jodi Allison, 36, lives at 204 Bluegrass in Cl...	Sofia Kelley, in their late 30s, lives at 312 ...	0.936364	0.0	0.0	False	False
3	James Mills is a 69‑year‑old paramedic who liv...	Victor Harper is a paramedic in their late 60s...	0.909091	0.0	0.0	False	False
4	Nancy Burton is a 21‑year‑old cashier who live...	Sofia Hawthorne is in her early twenties and w...	0.888889	0.6	0.055046	False	False

In [9]:

Copied!

result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()

Out[9]:

	biography_rewritten	utility_score	leakage_mass	needs_human_review
0	Ethan Hawthorne, a middle-aged Hispanic veteri...	0.814286	0.0	False
1	Rafael Kumar is in his late 30s, an astronomer...	0.935714	0.9	False
2	Sofia Kelley, in their late 30s, lives at 312 ...	0.936364	0.0	False
3	Victor Harper is a paramedic in their late 60s...	0.909091	0.0	False
4	Sofia Hawthorne is in her early twenties and w...	0.888889	0.6	False

In [10]:

Copied!

result.trace_dataframe.columns.tolist()
result.trace_dataframe.columns.tolist()

Out[10]:

['biography',
 '_anonymizer_record_id',
 '_raw_detected_entities',
 '_seed_entities',
 '_initial_tagged_text',
 '_seed_entities_json',
 '_tag_notation',
 '_merged_tagged_text',
 '_validation_candidates',
 '_validated_entities',
 '_augmented_entities',
 '_merged_entities',
 '_detected_entities',
 'biography_with_spans',
 '_latent_entities',
 'final_entities',
 '_entities_by_value',
 '_entity_examples',
 '_entities_for_replace',
 '_entities_for_replace_json',
 '_replacement_map',
 '_domain',
 '_domain_supplement',
 '_sensitivity_disposition',
 '_sensitivity_disposition_block',
 '_privacy_qa',
 '_rewrite_disposition_block',
 '_meaning_units',
 '_replacement_map_for_prompt',
 '_meaning_units_serialized',
 '_full_rewrite',
 '_quality_qa',
 'biography_rewritten',
 '_repair_iterations',
 '_quality_qa_reanswer',
 '_privacy_qa_reanswer',
 '_quality_qa_compare',
 'utility_score',
 'leakage_mass',
 'weighted_leakage_rate',
 'any_high_leaked',
 '_needs_repair',
 '_leaked_privacy_items',
 '_rewritten_text__next',
 '_judge_evaluation',
 'needs_human_review']

🚩 Filter by review flag¶

Records where automated metrics exceed thresholds are flagged for manual review.
Use this to prioritize human attention on the records that need it most.
See Working with flagged records for guidance on diagnosing and resolving flagged records.

In [11]:

Copied!





df = result.dataframe
flagged = df[df["needs_human_review"] == True]  # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True]  # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()

0 of 25 records flagged for human review

Out[11]:

	biography	biography_rewritten	utility_score	leakage_mass	weighted_leakage_rate	any_high_leaked	needs_human_review

⏭️ Next steps¶

⚖️ Rewriting Legal Documents -- rewrite legal text with custom entity labels and domain-specific privacy goals.
🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.
🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.