Skip to content

Choosing a strategy

This guide walks through the decisions you make when configuring an AnonymizerConfig for a real dataset. Use it to go from "I have data + a goal" to a concrete starting config in a few minutes.

It is also the primary reference for AI agents that drive the Anonymizer skill: every decision below is something the agent has to make on the user's behalf.


Decision flow

1. (Detection) Describe the data?    → AnonymizerInput(data_summary=...)
2. (Detection) Detection knobs?      → Detect(entity_labels=..., gliner_threshold=...)
3. Replace or Rewrite?               → AnonymizerConfig(replace=...) vs AnonymizerConfig(rewrite=...)
4. (Replace) Which strategy?         → Substitute(...) | Redact(...) | Annotate(...) | Hash(...)
5. (Rewrite) Privacy goal?           → Rewrite(privacy_goal=PrivacyGoal(protect=..., preserve=...))
6. (Rewrite) Risk tolerance?         → Rewrite(risk_tolerance="minimal" | "low" | "moderate" | "high")

Steps 1–2 govern the detection stage that runs first in both modes — usually the highest-leverage way to improve overall quality on a new dataset. Steps 3–6 shape the mode-specific transformation that follows detection.


1. (Detection) data_summary

AnonymizerInput.data_summary is an optional one-line description that flows into LLM prompts. It is the single cheapest quality lever you have. It improves detection — which runs first in both Replace and Rewrite modes, so it's a precursor to any transformation. In Rewrite mode it additionally provides context on what the data contains, helping the rewriter preserve meaning.

from anonymizer import AnonymizerInput

data = AnonymizerInput(
    source="patient_notes.csv",
    text_column="note",
    data_summary="De-identified inpatient progress notes from a US oncology service",
)

What to include:

  • The domain (clinical, legal, financial, customer support, etc.)
  • The genre (notes, transcripts, opinions, biographies)
  • Anything about the source the engine couldn't infer from a single record (e.g. "transcribed phone calls — expect disfluencies")
  • data_summary is the only way to provide a soft do-not-tag list for the augmenter when entity_labels=None — the augmenter is free to invent labels beyond DEFAULT_ENTITY_LABELS, so use it to tell the LLM what not to tag (e.g. "do not tag generic anatomical terms, medication class names, or job titles as PII").

What to leave out:

  • Lists of entity types you want detected (those go in Detect.entity_labels)
  • Privacy/utility goals (those go in Rewrite.privacy_goal)
  • Substitute behavior instructions (e.g. "names should remain Portuguese", "preserve numeric magnitude") — those go in Substitute(instructions)
  • Generic phrasing ("text data" adds no signal)

2. (Detection) Detection knobs

For most datasets the detection defaults work. The main reason to adjust entity_labels is when your data has domain-specific entities that can be described in plain English — GLiNER is zero-shot, so any concept you can name (e.g. "clinical_facility", "internal_project_codename") becomes an entity it can find. Match the snake_case convention of DEFAULT_ENTITY_LABELS. If the entities you care about aren't in the default list, write them down and add them. Adjust gliner_threshold only when you see a specific recall or precision problem in preview.

entity_labels

Setting Behavior Use when
None (default) Detect all DEFAULT_ENTITY_LABELS; the augmenter LLM can also infer new labels not in the default set General-purpose — almost always the right starting point
Explicit list Strict mode — only the labels you list are detected, augmenter cannot invent new ones You have a domain-specific entity that the defaults don't cover, or you want to narrow detection to a known short list

Common ways to extend the default list:

  • Healthcare: clinical_facility, diagnosis_code, medication_name, lab_test_code
  • Legal: case_number, docket_number, statute_citation, judge_name
  • Customer support: ticket_id, internal_user_id, transaction_id
  • Internal: cost_center, internal_project_codename, experiment_id
from anonymizer import DEFAULT_ENTITY_LABELS, Detect

detect = Detect(entity_labels=[*DEFAULT_ENTITY_LABELS, "clinical_facility", "diagnosis_code", "medication_name"])

gliner_threshold

Default 0.3. The validator catches false positives downstream, so erring low is safe.

Symptom Move Try
Entities are being missed Lower 0.2 or even 0.15
Validator is slow / expensive — it's being handed a huge candidate list Raise 0.40.5

The trade-off is symmetric. Lowering the threshold doesn't hurt accuracy — the validator runs in batches of validation_max_entities_per_call (default 100, tunable on Detect), so a long candidate list becomes more validator calls but not a worse validator. The cost of gliner_threshold=0.2 is latency and tokens, not precision. Raising the threshold trades that cost for recall risk: GLiNER stops surfacing borderline candidates and you're relying on the augmenter LLM alone to fill the gap. Default 0.3 errs low; raise only when validator cost is hurting you, and verify with an Annotate preview before trusting a high-threshold setup.


3. Replace vs Rewrite

Both modes start from the same detection pipeline. The difference is what happens after entities are detected.

Question Replace Rewrite
Is the goal "scrub the entities and keep everything else"?
Is the goal "produce a privacy-safe version of this text that downstream models can train on"?
Are there inferable / latent identifiers that aren't explicitly stated (e.g. "during her third round of chemo" → cancer treatment)? ❌ leaves them ✅ removes them
Additional LLM calls (beyond shared detection) ~1 (Substitute) or 0 (Redact/Annotate/Hash) Many (domain → disposition → QA → rewrite → evaluate → repair → judge)
Output text length ≈ same as input Often shorter / restructured
Best for Structured records, log scrubbing, known-list redaction Free-text data with implicit identifiers (clinical notes, biographies, depositions, support transcripts)

Picking between them. If your data has inferable identifiers that survive entity-only scrubbing (clinical notes, biographies, depositions), Rewrite is the right fit. For structured records, logs, or single-cell PII, Replace is faster and preserves shape. If you're unsure, walk through a few sample rows before deciding.


4. (Replace) Which strategy

The four strategies are summarised in Replace. The decision rule:

You want… Use Why
Realistic-looking text safe for sharing or training Substitute LLM-generated synthetic values preserve readability
Clear visual marking that an entity was removed Redact [REDACTED_FIRST_NAME] is unambiguous
To inspect what was detected without losing the original Annotate Original text is preserved next to the label — not privacy-safe on its own
Deterministic re-identification across documents (same person → same token) Hash Same input always produces the same hash digest

If you're not sure which to pick, use Substitute. It's the most general-purpose choice and matches the bulk of production usage.

Writing Substitute.instructions

Substitute accepts free-form instructions that are passed to the replacement-generator LLM. Use them when the default behavior produces values that don't match your domain or downstream constraints.

Pattern When to use Example
Format constraint The original has a structural shape that must be preserved "Replacement IDs must keep the same prefix as the original (e.g. ACME-12345 → ACME-XXXXX)."
Domain hint Entities are domain-specific and need plausible domain values "Replacement names should be plausible Brazilian Portuguese names."
Negative constraint Avoid certain values "Do not use any name that appears in the original text."

Keep instructions short (one or two sentences). Long instructions compete with the per-entity context and degrade quality.

Substitute is per-row, not per-dataset

Within a single row, repeated mentions of the same value get one consistent replacement (entities are grouped by value before the LLM call). Across rows the LLM has no shared memory — each row is an independent call, so "Alice" in row 1 and "Alice" in row 47 will likely get different replacements. If you need stable cross-row mappings (e.g. to re-join records by an identifier), use Hash instead, or post-process result.trace_dataframe["_replacement_map"].


5. (Rewrite) Privacy goal

Rewrite ships with sensible defaults for protect and preserve (auto-populated when you pass Rewrite() with no arguments). Override them when you can be more specific than the generic defaults.

How to write protect

protect answers: "What should not appear in the output, even by inference?"

Pattern Example
Direct identifiers + quasi-identifiers "All patient names, medical record numbers, dates of birth, and any combinations of attributes that could re-identify an individual"
Explicit category list "Names, addresses, phone numbers, employer names, and any references to specific institutions"
Inferable signals to suppress "Direct identifiers and any contextual phrases that could imply a specific medical condition or diagnosis"
Domain-specific identifiers "Case numbers, court names, judge names, and any geographic identifiers below the state level"

How to write preserve

preserve answers: "What does the rewritten text need to keep so it's still useful?"

Pattern Example
Domain content "Clinical findings, treatment plans, and medical terminology"
Structural properties "The narrative flow, approximate timeline, and emotional tone of the conversation"
Statistical properties "The age range and approximate location at country level so downstream demographics analysis remains valid"
Task-relevant signals "Argument structure, citations to legal precedent, and the procedural posture of the case"

Be specific, but stay short

Both fields must be 10–1000 characters and at least 3 words. Aim for 1–3 sentences. The more concrete you are, the more reliably the rewriter targets the right things.

When to set strict_entity_protection=True

Default: False. Only set True when explicitly required by compliance or audit policy — not just because the data is medical, legal, or financial.

By default, low-risk quasi-identifiers may be left unchanged when the engine judges them safe in context. Set strict_entity_protection=True to force every detected entity into an active protection method.

Use it when:

  • A documented compliance or audit policy mandates that every detected entity be actively protected (e.g. HIPAA Safe Harbor with strict interpretation, internal "zero unchanged identifiers" rule)
  • You're producing data for external sharing where any unchanged identifier is a compliance risk
  • Audit requires "every entity was actively protected"

Being in a regulated domain (medical / legal / financial) is not by itself a reason to set this to True — most regulated-domain processing tolerates the default behavior. Don't use it when utility matters more than blanket protection — it tends to increase modifications and can lower utility_score.


6. (Rewrite) Risk tolerance

risk_tolerance is a rewrite-only knob — it selects a coherent bundle of repair and review thresholds. The full table is in Rewrite > Risk tolerance; the choice rule is below.

Goal Pick
"Medical / legal / financial / external release" minimal
Default for most privacy-sensitive data low
"I want utility prioritized, this is internal-only" moderate
"I just want to see the system run, will fix things by hand" high

Notes:

  • minimal and low differ mostly in how aggressively repair triggers. Both auto-repair on any high-sensitivity leak.
  • high does not auto-repair single high-sensitivity leaks. Use only when you have downstream review.
  • max_repair_iterations (default 3) caps cost. Set to 0 to skip repair entirely while still computing leakage / utility metrics — useful for audits.

Goal → starting config cheat sheet

A starting point for common scenarios. Always run preview and iterate from here.

Goal Mode Strategy / config
"Scrub PII from logs for retention" Replace Redact()
"De-identify clinical notes for research sharing" Rewrite Rewrite(privacy_goal=PrivacyGoal(protect="all PHI and any context that could imply a specific patient or facility", preserve="clinical findings, treatments, and outcomes"), risk_tolerance="minimal", strict_entity_protection=True)
"Produce realistic-looking biographies for demos" Replace Substitute(instructions="Names and locations should remain plausible for the original cultural context.")
"Anonymize survey responses before sharing the dataset" Replace Substitute()
"Anonymize customer support transcripts for fine-tuning a model" Replace Substitute(instructions="Preserve domain-specific terminology and locale.")
"Anonymize legal opinions for an SFT dataset" Rewrite Rewrite(privacy_goal=PrivacyGoal(protect="party names, case numbers, judge names, and locations below the state level", preserve="argument structure and procedural posture"), risk_tolerance="low")
"Allow re-joining records by identifier without keeping the identifier" Replace Hash(algorithm="sha256", digest_length=16)
"I just want to see what the detector finds" Replace Annotate() (preview only — never ship Annotate output as anonymized data)

Once you have a starting config, run anonymizer validate <config>, then anonymizer preview --num-records 5 <config>, then iterate. See Troubleshooting for what to change when preview shows a problem.