Rewrite¶
Instead of replacing individual entities, rewrite mode transforms the entire text to produce a privacy-safe version that reduces both explicit identifiers and implicit, inferable signals that could lead to re-identification. While overall meaning is preserved where possible, rewrite mode intentionally modifies or removes details when needed to eliminate latent entities and break inference pathways.
How it works¶
Detection runs first (same as Replace mode, plus latent entity detection for context-inferable information). This includes identifying signals that may not be explicitly tagged but can be deduced from combinations of details (e.g., location inferred from contextual cues). The text is then classified by domain, and each entity or attribute is assigned a sensitivity disposition based on contextual risk, recognizing that quasi-identifiers can emerge from any aspect of the text.
The text is then rewritten to reduce identifiability, applying targeted transformations that disrupt inference (e.g., weakening or removing linking details) rather than simply rewording content. The rewritten output is evaluated for both quality and privacy leakage using adversarial testing. If thresholds are exceeded, the system automatically refines the rewrite. A final judge provides a qualitative assessment of the rewritten record. Any records that failed to meet standards are tagged for human review.
Basic usage¶
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite
anonymizer = Anonymizer()
config = AnonymizerConfig(rewrite=Rewrite())
data = AnonymizerInput(source="data.csv", text_column="text")
preview = anonymizer.preview(config=config, data=data, num_records=3)
preview.display_record()
Configuration¶
| Field | Default | Description |
|---|---|---|
privacy_goal |
Auto-populated | What to protect and what to preserve. |
instructions |
None |
Additional instructions for the rewrite LLM. |
risk_tolerance |
low |
Preset controlling repair and review thresholds: minimal, low, moderate, high. |
max_repair_iterations |
3 |
Maximum repair rounds. Set to 0 to disable repair. |
Privacy goal¶
PrivacyGoal tells the rewriter what matters. Defaults work for general-purpose data. Override for domain-specific needs:
from anonymizer.config.rewrite import PrivacyGoal
config = AnonymizerConfig(
rewrite=Rewrite(
privacy_goal=PrivacyGoal(
protect="All patient identifiers and clinical facility names",
preserve="Clinical findings, treatment plans, and medical terminology",
)
)
)
Be specific
The more precise the protect and preserve fields, the better the rewriter targets sensitive content while retaining what matters.
Risk tolerance¶
Controls the automated repair loop and human review flagging. Each preset bundles a coherent set of behaviors:
| Preset | Repair threshold | Repair on any high leak | Flag utility below | Flag leakage above |
|---|---|---|---|---|
minimal |
0.6 | Yes | 0.6 | 1.0 |
low |
1.0 | Yes | 0.5 | 2.0 |
moderate |
1.5 | Yes | 0.4 | 2.5 |
high |
2.0 | No | 0.3 | 3.0 |
The repair threshold is the leakage mass above which a record is sent for repair.
Leakage mass is a confidence-weighted sum of leaked entities, where each entity's weight reflects its sensitivity (high=1.0, medium=0.6, low=0.3). A leakage mass of 1.0 roughly equals one high-sensitivity entity leaked at full confidence.
config = AnonymizerConfig(
rewrite=Rewrite(
risk_tolerance="minimal",
max_repair_iterations=3,
)
)
Output columns¶
| Column | Description |
|---|---|
{text_col}_rewritten |
The privacy-safe rewritten text. |
utility_score |
Quality preservation (0.0--1.0). Higher is better. |
leakage_mass |
Weighted privacy leakage. Lower is better. |
weighted_leakage_rate |
Normalized leakage (0.0--1.0) relative to the maximum possible leakage mass. |
any_high_leaked |
Whether any high-sensitivity entity leaked through. |
needs_human_review |
Flag for records that may need manual review. |
Use preview.trace_dataframe for the full pipeline trace (domain, disposition, QA pairs, repair iterations, judge evaluation).
No entities? No rewrite.
Records with no detected entities pass through unchanged with utility_score=1.0 and leakage_mass=0.0.
Working with flagged records¶
Records with needs_human_review=True exceeded automated thresholds for leakage or utility. To investigate and resolve:
Diagnose: Use trace_dataframe to inspect the flagged record's intermediate columns — disposition, leakage breakdown, repair iterations, and judge evaluation.
flagged = result.trace_dataframe[result.trace_dataframe["needs_human_review"] == True]
flagged[["utility_score", "leakage_mass", "any_high_leaked"]].head()
Tune and re-run: Adjust settings and re-run on flagged records:
- Increase
max_repair_iterationsto give the rewriter more attempts. - Refine
privacy_goalwith more specificprotect/preserveinstructions for the domain. - Lower
risk_tolerance(e.g.minimal) to trigger more aggressive repair.
Last resort: Manually edit or exclude records that resist automated repair — some text is inherently difficult to rewrite without losing utility or leaking identifiers, and requires your judgement as the expert.
Model roles¶
Rewrite uses multiple LLM roles. All default to models in the default config:
| Role | Default | Purpose |
|---|---|---|
domain_classifier |
gpt-oss-120b |
Classifies text domain. |
disposition_analyzer |
gpt-oss-120b |
Assigns sensitivity levels. |
meaning_extractor |
gpt-oss-120b |
Extracts meaning units. |
qa_generator |
gpt-oss-120b |
Generates QA pairs for evaluation. |
rewriter |
gpt-oss-120b |
Generates the rewritten text. |
evaluator |
nemotron-30b-thinking |
Evaluates quality and leakage. |
repairer |
gpt-oss-120b |
Repairs high-leakage rewrites. |
judge |
nemotron-30b-thinking |
Final quality/privacy judge. |