🕵️ Rewriting Biographies¶
Instead of replacing entities with tokens, rewrite mode generates a privacy-safe transformation of the entire text. The pipeline:
- Detects entities (same as replace mode, plus latent entity detection)
- Classifies the domain and assigns sensitivity dispositions
- Generates a rewritten version that obscures sensitive entities
- Evaluates quality (utility) and privacy (leakage) with an automated repair loop
- Runs a final LLM judge for informational scores
📚 What you'll learn¶
- Configure rewrite mode with
PrivacyGoalto specify what to protect and what to preserve - Set evaluation criteria and risk tolerance for automated quality checks
- Preview rewritten text and inspect utility / leakage scores
- Triage flagged records with
needs_human_review
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- The default
build.nvidia.com(NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services. - Request and token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start withpreview()on a small sample, then move to your own endpoint for production data and usage.
- The default
- Import
RewriteandPrivacyGoal. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(LoggingConfig.default())keeps logs at INFO. Switch toLoggingConfig.debug()when troubleshooting.
In [1]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [ ]:
Copied!
from anonymizer import (
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
LoggingConfig,
PrivacyGoal,
Rewrite,
configure_logging,
)
configure_logging(LoggingConfig.default())
from anonymizer import (
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
LoggingConfig,
PrivacyGoal,
Rewrite,
configure_logging,
)
configure_logging(LoggingConfig.default())
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[16:06:37] [INFO] 🔧 Anonymizer initialized with 3 model configs [16:06:37] [INFO] |-- 🔎 detector: gliner-pii-detector [16:06:37] [INFO] |-- ✅ validator: gpt-oss-120b [16:06:37] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- Same biographies dataset used in earlier notebooks -- familiar data makes it easy to compare rewrite output against replace output.
In [4]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
🎛️ Configure¶
PrivacyGoalspells out what to protect and what to preserve -- this gives the rewriter clear, domain-specific guidance.risk_tolerance(default"low") andmax_repair_iterations(default3) control the automated quality gate -- see Risk tolerance for presets.
In [5]:
Copied!
config = AnonymizerConfig(
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
preserve="Career trajectory, educational background, and professional accomplishments",
),
),
)
config = AnonymizerConfig(
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
preserve="Career trajectory, educational background, and professional accomplishments",
),
),
)
👁️ Preview¶
preview()runs on a small sample so you can iterate on privacy goals and evaluation criteria before committing to a full run.
In [6]:
Copied!
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
[16:06:46] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography') [16:06:46] [INFO] 🔍 Running entity detection on 3 records [16:07:58] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [72.5s] [16:07:58] [INFO] |-- labels: first_name=22, state=6, organization_name=6, age=5, occupation=5, city=5, political_view=4, last_name=3, race_ethnicity=3, language=3, company_name=3, degree=2, field_of_study=2, education_level=2, street_address=2, place_name=1, date_of_birth=1, project_name=1, employment_status=1, religious_belief=1 [16:07:58] [INFO] ✏️ Running rewrite pipeline [16:10:14] [INFO] Evaluate-repair loop: all rows pass at iteration 0 [16:10:32] [INFO] |-- 📋 Rewrite complete (0 failed) [154.1s] [16:10:32] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [7]:
Copied!
preview.display_record(1)
preview.display_record(1)
🚀 Full run¶
result.dataframehas user-facing columns: rewritten text, scores, and the review flag.result.trace_dataframehas every intermediate column for debugging.
In [8]:
Copied!
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
[14:51:45] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography') [14:51:45] [INFO] 🔍 Running entity detection on 25 records
[14:54:03] [INFO] |-- 📋 Detection complete — 645 entities found across 25 records (0 failed) [137.9s] [14:54:03] [INFO] |-- labels: first_name=152, occupation=47, city=44, company_name=43, organization_name=31, race_ethnicity=30, state=30, education_level=30, last_name=26, age=26, religious_belief=26, political_view=25, street_address=23, university=22, language=21, field_of_study=14, place_name=10, county=10, date_of_birth=9, employment_status=9, degree=7, date=5, project_name=1, full_name=1, country=1, gender=1, postcode=1 [14:54:03] [INFO] ✏️ Running rewrite pipeline [15:12:06] [INFO] Evaluate-repair loop iteration 0: 7/25 rows need repair [15:12:53] [INFO] Evaluate-repair loop: all rows pass at iteration 1 [15:13:39] [INFO] |-- 📋 Rewrite complete (0 failed) [1076.7s] [15:13:39] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures
Out[8]:
| biography | biography_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | Bobby Watford, a 40‑year‑old Mexican veterinar... | Ethan Hawkins, a 40‑year‑old Mexican veterinar... | 1.0 | 0.0 | 0.0 | False | False |
| 1 | Idilio Bell is a 37‑year‑old astronomer living... | Rafael Kline is a 37‑year‑old astronomer livin... | 0.808333 | 0.0 | 0.0 | False | False |
| 2 | Jodi Allison, 36, lives at 204 Bluegrass in Cl... | Tara Kendall, 36, lives at a street address in... | 0.877778 | 0.0 | 0.0 | False | False |
| 3 | James Mills is a 69‑year‑old paramedic who liv... | Victor Hawthorne is a 69‑year‑old paramedic wh... | 0.909091 | 0.0 | 0.0 | False | False |
| 4 | Nancy Burton is a 21‑year‑old cashier who live... | Maya Hawthorne is a 21‑year‑old cashier who li... | 0.957692 | 0.0 | 0.0 | False | False |
In [9]:
Copied!
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
Out[9]:
| biography_rewritten | utility_score | leakage_mass | needs_human_review | |
|---|---|---|---|---|
| 0 | Ethan Hawkins, a 40‑year‑old Mexican veterinar... | 1.0 | 0.0 | False |
| 1 | Rafael Kline is a 37‑year‑old astronomer livin... | 0.808333 | 0.0 | False |
| 2 | Tara Kendall, 36, lives at a street address in... | 0.877778 | 0.0 | False |
| 3 | Victor Hawthorne is a 69‑year‑old paramedic wh... | 0.909091 | 0.0 | False |
| 4 | Maya Hawthorne is a 21‑year‑old cashier who li... | 0.957692 | 0.0 | False |
In [10]:
Copied!
result.trace_dataframe.columns.tolist()
result.trace_dataframe.columns.tolist()
Out[10]:
['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_tag_notation', '_seed_validation_candidates', '_seed_tagged_text', '_validated_entities', '_seed_entities_json', '_initial_tagged_text', '_validated_seed_entities', '_augmented_entities', '_merged_entities', '_merged_tagged_text', '_validation_candidates', '_detected_entities', 'biography_with_spans', '_latent_entities', 'final_entities', '_entities_by_value', '_replacement_map', '_domain', '_domain_supplement', '_domain_supplement_privacy', '_sensitivity_disposition', '_privacy_qa', '_sensitivity_disposition_block', '_rewrite_disposition_block', '_replacement_map_for_prompt', '_full_rewrite', 'biography_rewritten', '_meaning_units', '_meaning_units_serialized', '_quality_qa', '_repair_iterations', '_quality_qa_reanswer', '_quality_qa_compare', '_privacy_qa_reanswer', 'utility_score', 'leakage_mass', 'weighted_leakage_rate', 'any_high_leaked', '_needs_repair', '_leaked_privacy_items', '_rewritten_text__next', 'needs_human_review', '_judge_evaluation']
🚩 Filter by review flag¶
- Records where automated metrics exceed thresholds are flagged for manual review.
- Use this to prioritize human attention on the records that need it most.
- See Working with flagged records for guidance on diagnosing and resolving flagged records.
In [11]:
Copied!
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
0 of 25 records flagged for human review
Out[11]:
| biography | biography_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review |
|---|
⏭️ Next steps¶
- ⚖️ Rewriting Legal Documents -- rewrite legal text with custom entity labels and domain-specific privacy goals.
- 🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.
- 🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.