🕵️ Rewriting Biographies¶

Instead of replacing entities with tokens, rewrite mode generates a privacy-safe transformation of the entire text. The pipeline:

Detects entities (same as replace mode, plus latent entity detection)
Classifies the domain and assigns sensitivity dispositions
Generates a rewritten version that obscures sensitive entities
Evaluates quality (utility) and privacy (leakage) with an automated repair loop
Runs a final LLM judge for informational scores

📚 What you'll learn¶

Configure rewrite mode with PrivacyGoal to specify what to protect and what to preserve
Set evaluation criteria and risk tolerance for automated quality checks
Preview rewritten text and inspect utility / leakage scores
Triage flagged records with needs_human_review

Tip: First time running notebooks? Start with setup instructions.

⚙️ Setup¶

Check if your NVIDIA_API_KEY from build.nvidia.com is registered for model access.
- The default build.nvidia.com (NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services.
- Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint for production data and usage.
Import Rewrite and PrivacyGoal.
Anonymizer() initializes with the default model provider -- no extra config needed.
configure_logging(LoggingConfig.default()) keeps logs at INFO. Switch to LoggingConfig.debug() when troubleshooting.

In [1]:

Copied!





import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key
import getpass
import os

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key

In [ ]:

Copied!





from anonymizer import (
    Anonymizer,
    AnonymizerConfig,
    AnonymizerInput,
    LoggingConfig,
    PrivacyGoal,
    Rewrite,
    configure_logging,
)

configure_logging(LoggingConfig.default())
from anonymizer import (
    Anonymizer,
    AnonymizerConfig,
    AnonymizerInput,
    LoggingConfig,
    PrivacyGoal,
    Rewrite,
    configure_logging,
)

configure_logging(LoggingConfig.default())

In [3]:

Copied!

anonymizer = Anonymizer()
anonymizer = Anonymizer()

[16:06:37] [INFO] 🔧 Anonymizer initialized with 3 model configs
[16:06:37] [INFO]   |-- 🔎 detector:  gliner-pii-detector
[16:06:37] [INFO]   |-- ✅ validator: gpt-oss-120b
[16:06:37] [INFO]   |-- 🧩 augmenter: gpt-oss-120b

📦 Input data¶

Same biographies dataset used in earlier notebooks -- familiar data makes it easy to compare rewrite output against replace output.

In [4]:

Copied!





input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

🎛️ Configure¶

PrivacyGoal spells out what to protect and what to preserve -- this gives the rewriter clear, domain-specific guidance.
risk_tolerance (default "low") and max_repair_iterations (default 3) control the automated quality gate -- see Risk tolerance for presets.

In [5]:

Copied!





config = AnonymizerConfig(
    rewrite=Rewrite(
        privacy_goal=PrivacyGoal(
            protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
            preserve="Career trajectory, educational background, and professional accomplishments",
        ),
    ),
)
config = AnonymizerConfig(
    rewrite=Rewrite(
        privacy_goal=PrivacyGoal(
            protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
            preserve="Career trajectory, educational background, and professional accomplishments",
        ),
    ),
)

👁️ Preview¶

preview() runs on a small sample so you can iterate on privacy goals and evaluation criteria before committing to a full run.

In [6]:

Copied!





preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

preview.display_record(0)
preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

preview.display_record(0)

[16:06:46] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[16:06:46] [INFO] 🔍 Running entity detection on 3 records
[16:07:58] [INFO]   |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [72.5s]
[16:07:58] [INFO]   |-- labels: first_name=22, state=6, organization_name=6, age=5, occupation=5, city=5, political_view=4, last_name=3, race_ethnicity=3, language=3, company_name=3, degree=2, field_of_study=2, education_level=2, street_address=2, place_name=1, date_of_birth=1, project_name=1, employment_status=1, religious_belief=1
[16:07:58] [INFO] ✏️ Running rewrite pipeline
[16:10:14] [INFO] Evaluate-repair loop: all rows pass at iteration 0
[16:10:32] [INFO]   |-- 📋 Rewrite complete (0 failed) [154.1s]
[16:10:32] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures

Anonymizer Rewrite Preview (record 0)

Original

Bobby| first_name Watford| last_name, a 40| age‑year‑old Mexican| race_ethnicity veterinarian| occupation living in Denver| city, Colorado| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High| organization_name, he earned his DVM| degree at the University of Colorado Boulder| organization_name, where he also completed a research stint in wildlife health| field_of_study. Fluent in English| language, Bobby| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Bobby| first_name has worked at VCA Animal Hospital| company_name and later at the Colorado Veterinary Clinic| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Christian Democrat| political_view and often volunteers at local shelters, a habit encouraged by his wife, Maya| first_name, and their two teenage children, Aria and Leo| first_name. Outside the clinic, Bobby| first_name enjoys hiking the Rockies| place_name with his family and mentoring veterinary students from his alma mater.

Rewritten

Ethan Hawthorne, a 40‑year‑old Latinx veterinarian living in a city in Colorado, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High, he earned his DVM at a state university, where he also completed a research stint in wildlife health. Fluent in English, Ethan has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Ethan has worked at a veterinary practice and later at a regional veterinary clinic, where he now leads a busy mixed‑practice team. He identifies as having a moderate political view and often volunteers at local shelters, a habit encouraged by his wife, Leah, and their two teenage children, Sofia and Mateo. Outside the clinic, Ethan enjoys hiking the Rockies with his family and mentoring veterinary students from his alma mater.

Scores

Utility: 0.96Leakage: 0.00Weighted Leakage Rate: 0.00Needs Review: NoJudge: privacy: 8/10, quality: 9/10, naturalness: 9/10

Entity Disposition

Entity	Label	Sensitivity	Protection
Bobby	first_name	high	replace
Watford	last_name	high	replace
Maya	first_name	high	replace
Aria and Leo	first_name	high	replace
Mexican	race_ethnicity	high	generalize
Denver	city	medium	generalize
University of Colorado Boulder	organization_name	medium	generalize
Colorado Veterinary Clinic	organization_name	medium	generalize
VCA Animal Hospital	company_name	medium	generalize
Christian Democrat	political_view	medium	generalize
40	age	medium	leave_as_is
Jefferson High	organization_name	medium	leave_as_is
DVM	degree	medium	leave_as_is
wildlife health	field_of_study	low	leave_as_is
English	language	low	leave_as_is
Rockies	place_name	low	leave_as_is
veterinarian	occupation	low	leave_as_is
Colorado	state	low	leave_as_is
married	marital_status	high	leave_as_is
doctoral	education_level	low	leave_as_is

In [7]:

Copied!

preview.display_record(1)
preview.display_record(1)

Anonymizer Rewrite Preview (record 1)

Original

Jodi| first_name Allison| last_name, 36| age, lives at 204 Bluegrass| street_address in Clayton| city, North Carolina| state. A Caucasian| race_ethnicity editor| occupation with a lifelong love of words, she earned her BA| education_level in English| language from the University of North Carolina| state at Chapel Hill and cut her teeth on the newsroom of the Raleigh| city Times. After a stint as copy chief| occupation at Venture Media| company_name, Jodi| first_name moved to Southern Publishing| company_name, where she now leads the feature‑section team. She describes herself as a moderate Democrat| political_view and a Methodist| religious_belief who finds comfort in the rhythm of Sunday worship.  

Outside the office Jodi| first_name shares a busy home with her husband, Alex| first_name, a school counselor| occupation, and their two children, Ethan| first_name, 7| age, and Maya| first_name, 4| age. The family often spends weekends gardening in the yard behind their house or volunteering at the local library’s reading program, a tradition Jodi| first_name started as a teenager. Her friends say she balances deadlines with devotion to community, always keeping a notebook handy for the next story that matters.

Rewritten

Leah Bennett, in her mid‑30s, lives at an address in a small town. She is a media professional with a lifelong love of words. She earned a BA in English from a state university and began her career at a local newspaper. After serving as a senior copy manager at a media firm, she transitioned to a publishing organization where she now oversees a major department. She describes herself as a moderate Democrat and follows a faith tradition that includes weekly gatherings.

Outside work, Leah shares a busy home with her spouse, who works in education, and their children. Weekends are often spent tending to a garden or volunteering at a community library program, a habit she started as a teenager. Friends say she balances professional deadlines with community involvement, always keeping a notebook handy for the next story that matters.

Scores

Utility: 0.84Leakage: 0.00Weighted Leakage Rate: 0.00Needs Review: NoJudge: privacy: 10/10, quality: 9/10, naturalness: 9/10

Entity Disposition

Entity	Label	Sensitivity	Protection
204 Bluegrass	street_address	high	replace
Jodi	first_name	high	replace
Allison	last_name	high	replace
36	age	high	generalize
4	age	high	remove
7	age	high	remove
Alex	first_name	high	replace
BA	education_level	low	leave_as_is
Caucasian	race_ethnicity	low	leave_as_is
Clayton	city	high	generalize
English	language	low	leave_as_is
Ethan	first_name	high	remove
Maya	first_name	high	remove
Methodist	religious_belief	low	leave_as_is
North Carolina	state	high	generalize
Raleigh	city	high	generalize
Southern Publishing	company_name	high	generalize
Venture Media	company_name	high	generalize
copy chief	occupation	high	generalize
editor	occupation	high	generalize
moderate Democrat	political_view	low	leave_as_is
school counselor	occupation	low	leave_as_is
married	marital_status	medium	leave_as_is
2	num_children	medium	leave_as_is
feature_section_lead	position	medium	leave_as_is
publishing	sector	medium	leave_as_is
30s	age_bracket	medium	leave_as_is
protestant	religious_affiliation	medium	leave_as_is
democrat	political_orientation	medium	leave_as_is
suburban	home_neighborhood_type	medium	leave_as_is

🚀 Full run¶

result.dataframe has user-facing columns: rewritten text, scores, and the review flag.
result.trace_dataframe has every intermediate column for debugging.

In [8]:

Copied!

result = anonymizer.run(config=config, data=input_data)

result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)

result.dataframe.head()

[14:51:45] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[14:51:45] [INFO] 🔍 Running entity detection on 25 records

[14:54:03] [INFO]   |-- 📋 Detection complete — 645 entities found across 25 records (0 failed) [137.9s]
[14:54:03] [INFO]   |-- labels: first_name=152, occupation=47, city=44, company_name=43, organization_name=31, race_ethnicity=30, state=30, education_level=30, last_name=26, age=26, religious_belief=26, political_view=25, street_address=23, university=22, language=21, field_of_study=14, place_name=10, county=10, date_of_birth=9, employment_status=9, degree=7, date=5, project_name=1, full_name=1, country=1, gender=1, postcode=1
[14:54:03] [INFO] ✏️ Running rewrite pipeline
[15:12:06] [INFO] Evaluate-repair loop iteration 0: 7/25 rows need repair
[15:12:53] [INFO] Evaluate-repair loop: all rows pass at iteration 1
[15:13:39] [INFO]   |-- 📋 Rewrite complete (0 failed) [1076.7s]
[15:13:39] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures

Out[8]:

	biography	biography_rewritten	utility_score	any_high_leaked	needs_human_review
0	Bobby Watford, a 40‑year‑old Mexican veterinar...	Ethan Hawkins, a 40‑year‑old Mexican veterinar...	1.0	False	False
1	Idilio Bell is a 37‑year‑old astronomer living...	Rafael Kline is a 37‑year‑old astronomer livin...	0.808333	False	False
2	Jodi Allison, 36, lives at 204 Bluegrass in Cl...	Tara Kendall, 36, lives at a street address in...	0.877778	False	False
3	James Mills is a 69‑year‑old paramedic who liv...	Victor Hawthorne is a 69‑year‑old paramedic wh...	0.909091	False	False
4	Nancy Burton is a 21‑year‑old cashier who live...	Maya Hawthorne is a 21‑year‑old cashier who li...	0.957692	False	False

In [9]:

Copied!

result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()

Out[9]:

	biography_rewritten	utility_score	needs_human_review
0	Ethan Hawkins, a 40‑year‑old Mexican veterinar...	1.0	False
1	Rafael Kline is a 37‑year‑old astronomer livin...	0.808333	False
2	Tara Kendall, 36, lives at a street address in...	0.877778	False
3	Victor Hawthorne is a 69‑year‑old paramedic wh...	0.909091	False
4	Maya Hawthorne is a 21‑year‑old cashier who li...	0.957692	False

In [10]:

Copied!

result.trace_dataframe.columns.tolist()
result.trace_dataframe.columns.tolist()

Out[10]:

['biography',
 '_anonymizer_record_id',
 '_raw_detected_entities',
 '_seed_entities',
 '_tag_notation',
 '_seed_validation_candidates',
 '_seed_tagged_text',
 '_validated_entities',
 '_seed_entities_json',
 '_initial_tagged_text',
 '_validated_seed_entities',
 '_augmented_entities',
 '_merged_entities',
 '_merged_tagged_text',
 '_validation_candidates',
 '_detected_entities',
 'biography_with_spans',
 '_latent_entities',
 'final_entities',
 '_entities_by_value',
 '_replacement_map',
 '_domain',
 '_domain_supplement',
 '_domain_supplement_privacy',
 '_sensitivity_disposition',
 '_privacy_qa',
 '_sensitivity_disposition_block',
 '_rewrite_disposition_block',
 '_replacement_map_for_prompt',
 '_full_rewrite',
 'biography_rewritten',
 '_meaning_units',
 '_meaning_units_serialized',
 '_quality_qa',
 '_repair_iterations',
 '_quality_qa_reanswer',
 '_quality_qa_compare',
 '_privacy_qa_reanswer',
 'utility_score',
 'leakage_mass',
 'weighted_leakage_rate',
 'any_high_leaked',
 '_needs_repair',
 '_leaked_privacy_items',
 '_rewritten_text__next',
 'needs_human_review',
 '_judge_evaluation']

🚩 Filter by review flag¶

Records where automated metrics exceed thresholds are flagged for manual review.
Use this to prioritize human attention on the records that need it most.
See Working with flagged records for guidance on diagnosing and resolving flagged records.

In [11]:

Copied!





df = result.dataframe
flagged = df[df["needs_human_review"] == True]  # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True]  # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()

0 of 25 records flagged for human review

Out[11]:

	biography	biography_rewritten	utility_score	leakage_mass	weighted_leakage_rate	any_high_leaked	needs_human_review

⏭️ Next steps¶

⚖️ Rewriting Legal Documents -- rewrite legal text with custom entity labels and domain-specific privacy goals.
🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.
🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.