🕵️ Choosing a Replacement Strategy¶
Four replace mode strategies compared side-by-side on the same data.
| Strategy | What it does |
|---|---|
| Substitute | LLM-generated contextual replacements |
| Redact | Label-based markers ([REDACTED_FIRST_NAME]) |
| Annotate | Tags entities but keeps original text |
| Hash | Deterministic hash digest |
📚 What you'll learn¶
- Compare Redact, Annotate, Hash, and Substitute on the same input
- Customize output formats with
format_template - Understand which strategy fits your use case (readability, determinism, privacy)
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- Treat the default
build.nvidia.comsetup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data. - Request/token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full runs. Start withpreview()on a small sample.
- Treat the default
- Import all four strategy classes:
Redact,Annotate,Hash,Substitute. Anonymizer()initializes with the default model provider -- no extra config needed.Anonymizer.configure_logging()controls verbosity -- switch toAnonymizer.configure_logging(LoggingConfig.debug())when troubleshooting.
In [1]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [2]:
Copied!
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, Hash, Redact, Substitute
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, Hash, Redact, Substitute
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:40:20] [INFO] 🔧 Anonymizer initialized with 3 model configs
[13:40:20] [INFO] |-- 🔎 detector: gliner-pii-detector
[13:40:20] [INFO] |-- ✅ validator: gpt-oss-120b
[13:40:20] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- We use the same biographies dataset throughout so each strategy is compared on identical input.
In [4]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
🔄 Substitute¶
- Uses an LLM to generate contextually appropriate synthetic replacements.
- The LLM considers the full document context matching names with emails, cities to states, etc.
- Customize with
instructionsto steer the LLM's replacement choices.
In [5]:
Copied!
substitute_config = AnonymizerConfig(replace=Substitute())
substitute_preview = anonymizer.preview(
config=substitute_config,
data=input_data,
num_records=3,
)
substitute_config = AnonymizerConfig(replace=Substitute())
substitute_preview = anonymizer.preview(
config=substitute_config,
data=input_data,
num_records=3,
)
[13:40:20] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:40:20] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:40:20] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:40:20] [INFO] 🔍 Running entity detection on 3 records
[13:41:02] [INFO] |-- 📋 Detection complete — 79 entities found across 3 records (0 failed) [41.5s]
[13:41:02] [INFO] |-- labels: first_name=22, organization_name=8, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, political_view=4, last_name=3, race_ethnicity=3, language=2, street_address=2, place_name=1, date_of_birth=1, device_identifier=1, employment_status=1, religious_belief=1
[13:41:02] [INFO] 🔄 Running Substitute replacement
[13:41:18] [INFO] |-- 📋 Replacement complete (0 failed) [16.0s]
[13:41:18] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [6]:
Copied!
substitute_preview.display_record(0)
substitute_preview.display_record(0)
Custom instructions¶
- Pass
instructionsto guide the LLM -- e.g. keep replacements within a specific region, culture, or naming convention.
In [7]:
Copied!
substitute_custom_config = AnonymizerConfig(
replace=Substitute(instructions="Use only Japanese names and locations for all replacements.")
)
substitute_custom_preview = anonymizer.preview(
config=substitute_custom_config,
data=input_data,
num_records=3,
)
substitute_custom_preview.display_record(0)
substitute_custom_config = AnonymizerConfig(
replace=Substitute(instructions="Use only Japanese names and locations for all replacements.")
)
substitute_custom_preview = anonymizer.preview(
config=substitute_custom_config,
data=input_data,
num_records=3,
)
substitute_custom_preview.display_record(0)
[13:41:18] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:41:18] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:41:18] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:41:18] [INFO] 🔍 Running entity detection on 3 records
[13:41:53] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [34.7s]
[13:41:53] [INFO] |-- labels: first_name=22, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1
[13:41:53] [INFO] 🔄 Running Substitute replacement
[13:42:12] [INFO] |-- 📋 Replacement complete (0 failed) [19.5s]
[13:42:12] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🚫 Redact¶
- Replaces each entity with a label-based marker. Default:
[REDACTED_FIRST_NAME]. - Customize with
Redact(format_template=...).
In [8]:
Copied!
redact_config = AnonymizerConfig(replace=Redact())
redact_preview = anonymizer.preview(
config=redact_config,
data=input_data,
num_records=3,
)
redact_preview.display_record(0)
redact_config = AnonymizerConfig(replace=Redact())
redact_preview = anonymizer.preview(
config=redact_config,
data=input_data,
num_records=3,
)
redact_preview.display_record(0)
[13:42:12] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:42:12] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:42:12] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:42:12] [INFO] 🔍 Running entity detection on 3 records
[13:43:00] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [47.6s]
[13:43:00] [INFO] |-- labels: first_name=23, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1
[13:43:00] [INFO] 🔄 Running Redact replacement
[13:43:00] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:43:00] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
format_template="***"replaces every entity with the same constant.
In [9]:
Copied!
custom_config = AnonymizerConfig(replace=Redact(format_template="***"))
custom_preview = anonymizer.preview(
config=custom_config,
data=input_data,
num_records=3,
)
custom_preview.display_record(0)
custom_config = AnonymizerConfig(replace=Redact(format_template="***"))
custom_preview = anonymizer.preview(
config=custom_config,
data=input_data,
num_records=3,
)
custom_preview.display_record(0)
[13:43:00] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:43:00] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:43:00] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:43:00] [INFO] 🔍 Running entity detection on 3 records
[13:43:41] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [41.7s]
[13:43:41] [INFO] |-- labels: first_name=22, organization_name=8, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, language=2, political_view=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1
[13:43:41] [INFO] 🔄 Running Redact replacement
[13:43:41] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:43:41] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🏷️ Annotate¶
- Tags each entity with its label but keeps the original text visible.
Default:
<Alice, first_name>. - Customize with
format_template-- must include{text}and{label}, e.g.Annotate(format_template="<{text}-|-{label}>").
In [10]:
Copied!
annotate_config = AnonymizerConfig(replace=Annotate())
annotate_preview = anonymizer.preview(
config=annotate_config,
data=input_data,
num_records=3,
)
annotate_preview.display_record(0)
annotate_config = AnonymizerConfig(replace=Annotate())
annotate_preview = anonymizer.preview(
config=annotate_config,
data=input_data,
num_records=3,
)
annotate_preview.display_record(0)
[13:43:41] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:43:41] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:43:41] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:43:41] [INFO] 🔍 Running entity detection on 3 records
[13:44:26] [INFO] |-- 📋 Detection complete — 79 entities found across 3 records (0 failed) [44.4s]
[13:44:26] [INFO] |-- labels: first_name=22, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, project_name=1, employment_status=1, company_name=1
[13:44:26] [INFO] 🔄 Running Annotate replacement
[13:44:26] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:44:26] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
- Override the default format with any string containing
{text}and{label}.
In [11]:
Copied!
annotate_custom_config = AnonymizerConfig(replace=Annotate(format_template="<{text}-|-{label}>"))
annotate_custom_preview = anonymizer.preview(
config=annotate_custom_config,
data=input_data,
num_records=3,
)
annotate_custom_preview.display_record(0)
annotate_custom_config = AnonymizerConfig(replace=Annotate(format_template="<{text}-|-{label}>"))
annotate_custom_preview = anonymizer.preview(
config=annotate_custom_config,
data=input_data,
num_records=3,
)
annotate_custom_preview.display_record(0)
[13:44:26] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:44:26] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:44:26] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:44:26] [INFO] 🔍 Running entity detection on 3 records
[13:45:13] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [46.6s]
[13:45:13] [INFO] |-- labels: first_name=22, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, place_name=1, date_of_birth=1, telescope_name=1, employment_status=1
[13:45:13] [INFO] 🔄 Running Annotate replacement
[13:45:13] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:45:13] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
#️⃣ Hash¶
- Deterministic -- same input always produces the same hash.
- Customize with
format_template(must include{digest}),algorithm(sha256/sha1/md5), anddigest_length(6-64 characters).
In [12]:
Copied!
hash_config = AnonymizerConfig(replace=Hash())
hash_preview = anonymizer.preview(
config=hash_config,
data=input_data,
num_records=3,
)
hash_preview.display_record(0)
hash_config = AnonymizerConfig(replace=Hash())
hash_preview = anonymizer.preview(
config=hash_config,
data=input_data,
num_records=3,
)
hash_preview.display_record(0)
[13:45:13] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:45:13] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:45:13] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:45:13] [INFO] 🔍 Running entity detection on 3 records
[13:45:56] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [43.3s]
[13:45:56] [INFO] |-- labels: first_name=21, organization_name=8, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, political_view=3, last_name=2, race_ethnicity=2, language=2, religious_belief=2, street_address=2, place_name=1, full_name=1, date_of_birth=1, nationality=1, employment_status=1
[13:45:56] [INFO] 🔄 Running Hash replacement
[13:45:56] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:45:56] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
- Override the algorithm, digest length, and output format.
In [13]:
Copied!
hash_custom_config = AnonymizerConfig(replace=Hash(algorithm="md5", digest_length=8, format_template="[{digest}]"))
hash_custom_preview = anonymizer.preview(
config=hash_custom_config,
data=input_data,
num_records=3,
)
hash_custom_preview.display_record(0)
hash_custom_config = AnonymizerConfig(replace=Hash(algorithm="md5", digest_length=8, format_template="[{digest}]"))
hash_custom_preview = anonymizer.preview(
config=hash_custom_config,
data=input_data,
num_records=3,
)
hash_custom_preview.display_record(0)
[13:45:56] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:45:56] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:45:56] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:45:56] [INFO] 🔍 Running entity detection on 3 records
[13:46:37] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [41.4s]
[13:46:37] [INFO] |-- labels: first_name=23, organization_name=7, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1, religious_belief=1
[13:46:37] [INFO] 🔄 Running Hash replacement
[13:46:37] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:46:37] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
⏭️ Next steps¶
- 🕵️ Inspecting Detected Entities -- dig into what the detection pipeline found and debug quality.
- ✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.
- ⚖️ Rewriting Legal Documents -- rewrite legal text with domain-specific privacy goals.