🕵️ Choosing a Replacement Strategy¶
Four replace mode strategies compared side-by-side on the same data.
| Strategy | What it does |
|---|---|
| Substitute | LLM-generated contextual replacements |
| Redact | Label-based markers ([REDACTED_FIRST_NAME]) |
| Annotate | Tags entities but keeps original text |
| Hash | Deterministic hash digest |
📚 What you'll learn¶
- Compare Redact, Annotate, Hash, and Substitute on the same input
- Customize output formats with
format_template - Understand which strategy fits your use case (readability, determinism, privacy)
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- The default
build.nvidia.com(NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services. - Request and token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start withpreview()on a small sample, then move to your own endpoint for production data and usage.
- The default
- Import all four strategy classes:
Redact,Annotate,Hash,Substitute. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(LoggingConfig.default())keeps logs at INFO. Switch toLoggingConfig.debug()when troubleshooting.
In [2]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [ ]:
Copied!
from anonymizer import (
Annotate,
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
Hash,
LoggingConfig,
Redact,
Substitute,
configure_logging,
)
configure_logging(LoggingConfig.default())
from anonymizer import (
Annotate,
Anonymizer,
AnonymizerConfig,
AnonymizerInput,
Hash,
LoggingConfig,
Redact,
Substitute,
configure_logging,
)
configure_logging(LoggingConfig.default())
In [4]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:17:15] [INFO] 🔧 Anonymizer initialized with 3 model configs
[13:17:15] [INFO] |-- 🔎 detector: gliner-pii-detector
[13:17:15] [INFO] |-- ✅ validator: gpt-oss-120b
[13:17:15] [INFO] |-- 🧩 augmenter: gpt-oss-120b
📦 Input data¶
- We use the same biographies dataset throughout so each strategy is compared on identical input.
In [5]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
🔄 Substitute¶
- Uses an LLM to generate contextually appropriate synthetic replacements.
- The LLM considers the full document context matching names with emails, cities to states, etc.
- Customize with
instructionsto steer the LLM's replacement choices.
In [6]:
Copied!
substitute_config = AnonymizerConfig(replace=Substitute())
substitute_preview = anonymizer.preview(
config=substitute_config,
data=input_data,
num_records=3,
)
substitute_config = AnonymizerConfig(replace=Substitute())
substitute_preview = anonymizer.preview(
config=substitute_config,
data=input_data,
num_records=3,
)
[13:17:15] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:17:15] [INFO] 🔍 Running entity detection on 3 records
[13:17:15] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:17:36] [INFO] |-- 📋 Detection complete — 76 entities found across 3 records (0 failed) [20.6s]
[13:17:36] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, company_name=4, last_name=3, race_ethnicity=3, organization_name=3, language=3, political_view=3, education_level=3, field_of_study=2, street_address=2, degree=1, university=1, place_name=1, date_of_birth=1, project_name=1, employment_status=1, religious_belief=1
[13:17:36] [INFO] 🔄 Running Substitute replacement
[13:17:50] [INFO] |-- 📋 Replacement complete (0 failed) [14.6s]
[13:17:50] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
In [7]:
Copied!
substitute_preview.display_record(0)
substitute_preview.display_record(0)
Custom instructions¶
- Pass
instructionsto guide the LLM -- e.g. keep replacements within a specific region, culture, or naming convention.
In [8]:
Copied!
substitute_custom_config = AnonymizerConfig(
replace=Substitute(instructions="Use only Japanese names and locations for all replacements.")
)
substitute_custom_preview = anonymizer.preview(
config=substitute_custom_config,
data=input_data,
num_records=3,
)
substitute_custom_preview.display_record(0)
substitute_custom_config = AnonymizerConfig(
replace=Substitute(instructions="Use only Japanese names and locations for all replacements.")
)
substitute_custom_preview = anonymizer.preview(
config=substitute_custom_config,
data=input_data,
num_records=3,
)
substitute_custom_preview.display_record(0)
[13:17:51] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:17:51] [INFO] 🔍 Running entity detection on 3 records
[13:17:51] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:18:18] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [27.1s]
[13:18:18] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, organization_name=5, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, degree=2, field_of_study=2, education_level=2, religious_belief=2, street_address=2, university=1, date_of_birth=1, telescope_array=1, employment_status=1
[13:18:18] [INFO] 🔄 Running Substitute replacement
[13:18:28] [INFO] |-- 📋 Replacement complete (0 failed) [10.2s]
[13:18:28] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🚫 Redact¶
- Replaces each entity with a label-based marker. Default:
[REDACTED_FIRST_NAME]. - Customize with
Redact(format_template=...).
In [9]:
Copied!
redact_config = AnonymizerConfig(replace=Redact())
redact_preview = anonymizer.preview(
config=redact_config,
data=input_data,
num_records=3,
)
redact_preview.display_record(0)
redact_config = AnonymizerConfig(replace=Redact())
redact_preview = anonymizer.preview(
config=redact_config,
data=input_data,
num_records=3,
)
redact_preview.display_record(0)
[13:18:28] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:18:28] [INFO] 🔍 Running entity detection on 3 records
[13:18:28] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:18:54] [INFO] |-- 📋 Detection complete — 75 entities found across 3 records (0 failed) [25.6s]
[13:18:54] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, organization_name=4, education_level=4, last_name=3, race_ethnicity=3, language=3, company_name=3, political_view=3, religious_belief=2, street_address=2, university=1, place_name=1, date_of_birth=1, field_of_study=1, employment_status=1
[13:18:54] [INFO] 🔄 Running Redact replacement
[13:18:54] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:18:54] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
format_template="***"replaces every entity with the same constant.
In [10]:
Copied!
custom_config = AnonymizerConfig(replace=Redact(format_template="***"))
custom_preview = anonymizer.preview(
config=custom_config,
data=input_data,
num_records=3,
)
custom_preview.display_record(0)
custom_config = AnonymizerConfig(replace=Redact(format_template="***"))
custom_preview = anonymizer.preview(
config=custom_config,
data=input_data,
num_records=3,
)
custom_preview.display_record(0)
[13:18:54] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:18:54] [INFO] 🔍 Running entity detection on 3 records
[13:18:54] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:19:21] [INFO] |-- 📋 Detection complete — 75 entities found across 3 records (0 failed) [26.5s]
[13:19:21] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, organization_name=4, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, degree=2, field_of_study=2, education_level=2, street_address=2, place_name=1, date_of_birth=1, employment_status=1, religious_belief=1
[13:19:21] [INFO] 🔄 Running Redact replacement
[13:19:21] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:19:21] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🏷️ Annotate¶
- Tags each entity with its label but keeps the original text visible.
Default:
<Alice, first_name>. - Customize with
format_template-- must include{text}and{label}, e.g.Annotate(format_template="<{text}-|-{label}>").
In [11]:
Copied!
annotate_config = AnonymizerConfig(replace=Annotate())
annotate_preview = anonymizer.preview(
config=annotate_config,
data=input_data,
num_records=3,
)
annotate_preview.display_record(0)
annotate_config = AnonymizerConfig(replace=Annotate())
annotate_preview = anonymizer.preview(
config=annotate_config,
data=input_data,
num_records=3,
)
annotate_preview.display_record(0)
[13:19:21] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:19:21] [INFO] 🔍 Running entity detection on 3 records
[13:19:21] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:19:49] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [27.8s]
[13:19:49] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, organization_name=5, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, degree=2, field_of_study=2, education_level=2, street_address=2, university=1, place_name=1, date_of_birth=1, employment_status=1, religious_belief=1
[13:19:49] [INFO] 🔄 Running Annotate replacement
[13:19:49] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:19:49] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
- Override the default format with any string containing
{text}and{label}.
In [12]:
Copied!
annotate_custom_config = AnonymizerConfig(replace=Annotate(format_template="<{text}-|-{label}>"))
annotate_custom_preview = anonymizer.preview(
config=annotate_custom_config,
data=input_data,
num_records=3,
)
annotate_custom_preview.display_record(0)
annotate_custom_config = AnonymizerConfig(replace=Annotate(format_template="<{text}-|-{label}>"))
annotate_custom_preview = anonymizer.preview(
config=annotate_custom_config,
data=input_data,
num_records=3,
)
annotate_custom_preview.display_record(0)
[13:19:49] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:19:49] [INFO] 🔍 Running entity detection on 3 records
[13:19:49] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:20:16] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [26.8s]
[13:20:16] [INFO] |-- labels: first_name=22, state=6, organization_name=6, age=5, occupation=5, city=5, last_name=3, race_ethnicity=3, language=3, company_name=3, political_view=3, education_level=3, religious_belief=2, street_address=2, degree=1, university=1, place_name=1, date_of_birth=1, field_of_study=1, employment_status=1
[13:20:16] [INFO] 🔄 Running Annotate replacement
[13:20:16] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:20:16] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
#️⃣ Hash¶
- Deterministic -- same input always produces the same hash.
- Customize with
format_template(must include{digest}),algorithm(sha256/sha1/md5), anddigest_length(6-64 characters).
In [13]:
Copied!
hash_config = AnonymizerConfig(replace=Hash())
hash_preview = anonymizer.preview(
config=hash_config,
data=input_data,
num_records=3,
)
hash_preview.display_record(0)
hash_config = AnonymizerConfig(replace=Hash())
hash_preview = anonymizer.preview(
config=hash_config,
data=input_data,
num_records=3,
)
hash_preview.display_record(0)
[13:20:31] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:20:31] [INFO] 🔍 Running entity detection on 3 records
[13:20:31] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:21:42] [INFO] |-- 📋 Detection complete — 78 entities found across 3 records (0 failed) [71.4s]
[13:21:42] [INFO] |-- labels: first_name=23, state=6, age=5, occupation=5, city=5, organization_name=4, last_name=3, race_ethnicity=3, language=3, company_name=3, political_view=3, education_level=3, religious_belief=2, street_address=2, school_name=1, degree=1, university=1, clinic_name=1, place_name=1, date_of_birth=1, field_of_study=1, employment_status=1
[13:21:42] [INFO] 🔄 Running Hash replacement
[13:21:42] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:21:42] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
Custom template¶
- Override the algorithm, digest length, and output format.
In [14]:
Copied!
hash_custom_config = AnonymizerConfig(replace=Hash(algorithm="md5", digest_length=8, format_template="[{digest}]"))
hash_custom_preview = anonymizer.preview(
config=hash_custom_config,
data=input_data,
num_records=3,
)
hash_custom_preview.display_record(0)
hash_custom_config = AnonymizerConfig(replace=Hash(algorithm="md5", digest_length=8, format_template="[{digest}]"))
hash_custom_preview = anonymizer.preview(
config=hash_custom_config,
data=input_data,
num_records=3,
)
hash_custom_preview.display_record(0)
[13:21:43] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:21:43] [INFO] 🔍 Running entity detection on 3 records
[13:21:43] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:22:18] [INFO] |-- 📋 Detection complete — 76 entities found across 3 records (0 failed) [34.9s]
[13:22:18] [INFO] |-- labels: first_name=22, state=6, age=5, occupation=5, city=5, organization_name=4, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, degree=2, field_of_study=2, education_level=2, street_address=2, university=1, place_name=1, date_of_birth=1, employment_status=1, religious_belief=1
[13:22:18] [INFO] 🔄 Running Hash replacement
[13:22:18] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:22:18] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
⏭️ Next steps¶
- 🕵️ Inspecting Detected Entities -- dig into what the detection pipeline found and debug quality.
- ✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.
- ⚖️ Rewriting Legal Documents -- rewrite legal text with domain-specific privacy goals.