🕵️ Rewriting Biographies¶
Instead of replacing entities with tokens, rewrite mode generates a privacy-safe transformation of the entire text. The pipeline:
- Detects entities (same as replace mode, plus latent entity detection)
- Classifies the domain and assigns sensitivity dispositions
- Generates a rewritten version that obscures sensitive entities
- Evaluates quality (utility) and privacy (leakage) with an automated repair loop
- Runs a final LLM judge for informational scores
📚 What you'll learn¶
- Configure rewrite mode with
PrivacyGoalto specify what to protect and what to preserve - Set evaluation criteria and risk tolerance for automated quality checks
- Preview rewritten text and inspect utility / leakage scores
- Triage flagged records with
needs_human_review
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- Treat the default
build.nvidia.comsetup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data. - Request/token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full runs. Start withpreview()on a small sample.
- Treat the default
- Import
RewriteandPrivacyGoal. Anonymizer()initializes with the default model provider -- no extra config needed.Anonymizer.configure_logging()controls verbosity -- switch toAnonymizer.configure_logging(LoggingConfig.debug())when troubleshooting.
In [1]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [2]:
Copied!
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal
configure_logging(LoggingConfig.default())
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite, configure_logging, LoggingConfig
from anonymizer.config.rewrite import PrivacyGoal
configure_logging(LoggingConfig.default())
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[16:15:48] [INFO] 🔧 Anonymizer initialized with 3 model configs [16:15:48] [INFO] |-- 🔎 detector: gliner-pii-detector [16:15:48] [INFO] |-- ✅ validator: gpt-oss-120b [16:15:48] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- Same biographies dataset used in earlier notebooks -- familiar data makes it easy to compare rewrite output against replace output.
In [4]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
🎛️ Configure¶
PrivacyGoalspells out what to protect and what to preserve -- this gives the rewriter clear, domain-specific guidance.risk_tolerance(default"low") andmax_repair_iterations(default3) control the automated quality gate -- see Risk tolerance for presets.
In [5]:
Copied!
config = AnonymizerConfig(
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
preserve="Career trajectory, educational background, and professional accomplishments",
),
),
)
config = AnonymizerConfig(
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All direct identifiers and quasi-identifier combinations (names, locations, employers, dates)",
preserve="Career trajectory, educational background, and professional accomplishments",
),
),
)
👁️ Preview¶
preview()runs on a small sample so you can iterate on privacy goals and evaluation criteria before committing to a full run.
In [6]:
Copied!
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
preview.display_record(0)
[16:16:07] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography') [16:16:07] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list) [16:16:07] [INFO] |-- 👀 Preview mode: processing 3 of 25 records [16:16:07] [INFO] 🔍 Running entity detection on 3 records
[16:17:53] [INFO] |-- 📋 Detection complete — 80 entities found across 3 records (0 failed) [105.8s] [16:17:53] [INFO] |-- labels: first_name=23, organization_name=9, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1 [16:17:53] [INFO] ✏️ Running rewrite pipeline [16:21:15] [INFO] Evaluate-repair loop: all rows pass at iteration 0 [16:21:41] [INFO] |-- 📋 Rewrite complete (0 failed) [228.2s] [16:21:41] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [7]:
Copied!
preview.display_record(1)
preview.display_record(1)
🚀 Full run¶
result.dataframehas user-facing columns: rewritten text, scores, and the review flag.result.trace_dataframehas every intermediate column for debugging.
In [8]:
Copied!
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
result = anonymizer.run(config=config, data=input_data)
result.dataframe.head()
[16:22:12] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography') [16:22:12] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list) [16:22:12] [INFO] 🔍 Running entity detection on 25 records [16:29:31] [INFO] |-- 📋 Detection complete — 667 entities found across 25 records (0 failed) [438.9s] [16:29:31] [INFO] |-- labels: first_name=155, organization_name=64, occupation=47, city=40, university=36, field_of_study=34, race_ethnicity=30, last_name=27, age=26, state=25, degree=25, political_view=25, religious_belief=24, street_address=23, language=20, place_name=17, county=10, date_of_birth=9, employment_status=9, education_level=7, date=5, company_name=4, date_time=1, landmark=1, country=1, gender=1, postcode=1 [16:29:31] [INFO] ✏️ Running rewrite pipeline [16:50:53] [INFO] Evaluate-repair loop iteration 0: 8/25 rows need repair [16:52:53] [INFO] Evaluate-repair loop iteration 1: 1/25 rows need repair [16:54:14] [INFO] |-- 📋 Rewrite complete (0 failed) [1482.5s] [16:54:14] [INFO] 🎉 Pipeline complete — 25 records processed, 0 total failures
Out[8]:
| biography | biography_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review | |
|---|---|---|---|---|---|---|---|
| 0 | Bobby Watford, a 40‑year‑old Mexican veterinar... | Ethan Hawthorne, a middle-aged Hispanic veteri... | 0.814286 | 0.0 | 0.0 | False | False |
| 1 | Idilio Bell is a 37‑year‑old astronomer living... | Rafael Kumar is in his late 30s, an astronomer... | 0.935714 | 0.9 | 0.050279 | False | False |
| 2 | Jodi Allison, 36, lives at 204 Bluegrass in Cl... | Sofia Kelley, in their late 30s, lives at 312 ... | 0.936364 | 0.0 | 0.0 | False | False |
| 3 | James Mills is a 69‑year‑old paramedic who liv... | Victor Harper is a paramedic in their late 60s... | 0.909091 | 0.0 | 0.0 | False | False |
| 4 | Nancy Burton is a 21‑year‑old cashier who live... | Sofia Hawthorne is in her early twenties and w... | 0.888889 | 0.6 | 0.055046 | False | False |
In [9]:
Copied!
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
result.dataframe[["biography_rewritten", "utility_score", "leakage_mass", "needs_human_review"]].head()
Out[9]:
| biography_rewritten | utility_score | leakage_mass | needs_human_review | |
|---|---|---|---|---|
| 0 | Ethan Hawthorne, a middle-aged Hispanic veteri... | 0.814286 | 0.0 | False |
| 1 | Rafael Kumar is in his late 30s, an astronomer... | 0.935714 | 0.9 | False |
| 2 | Sofia Kelley, in their late 30s, lives at 312 ... | 0.936364 | 0.0 | False |
| 3 | Victor Harper is a paramedic in their late 60s... | 0.909091 | 0.0 | False |
| 4 | Sofia Hawthorne is in her early twenties and w... | 0.888889 | 0.6 | False |
In [10]:
Copied!
result.trace_dataframe.columns.tolist()
result.trace_dataframe.columns.tolist()
Out[10]:
['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_initial_tagged_text', '_seed_entities_json', '_tag_notation', '_merged_tagged_text', '_validation_candidates', '_validated_entities', '_augmented_entities', '_merged_entities', '_detected_entities', 'biography_with_spans', '_latent_entities', 'final_entities', '_entities_by_value', '_entity_examples', '_entities_for_replace', '_entities_for_replace_json', '_replacement_map', '_domain', '_domain_supplement', '_sensitivity_disposition', '_sensitivity_disposition_block', '_privacy_qa', '_rewrite_disposition_block', '_meaning_units', '_replacement_map_for_prompt', '_meaning_units_serialized', '_full_rewrite', '_quality_qa', 'biography_rewritten', '_repair_iterations', '_quality_qa_reanswer', '_privacy_qa_reanswer', '_quality_qa_compare', 'utility_score', 'leakage_mass', 'weighted_leakage_rate', 'any_high_leaked', '_needs_repair', '_leaked_privacy_items', '_rewritten_text__next', '_judge_evaluation', 'needs_human_review']
🚩 Filter by review flag¶
- Records where automated metrics exceed thresholds are flagged for manual review.
- Use this to prioritize human attention on the records that need it most.
- See Working with flagged records for guidance on diagnosing and resolving flagged records.
In [11]:
Copied!
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
df = result.dataframe
flagged = df[df["needs_human_review"] == True] # noqa: E712
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()
0 of 25 records flagged for human review
Out[11]:
| biography | biography_rewritten | utility_score | leakage_mass | weighted_leakage_rate | any_high_leaked | needs_human_review |
|---|
⏭️ Next steps¶
- ⚖️ Rewriting Legal Documents -- rewrite legal text with custom entity labels and domain-specific privacy goals.
- 🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.
- 🔍 Inspecting Detected Entities -- debug what the detection pipeline found before rewriting.