🕵️ NeMo Anonymizer¶
NeMo Anonymizer detects and protects PII through context-aware replacement and rewriting. It offers high-quality user-guided entity detection, followed by modification options that maintain context while inducing privacy. You can review what sensitive information was found, adjust your masking strategy, and generate anonymized text.
Pick a strategy:
Alice met with Bob and his daughter to review kindergarten application #A9349.
Maya met with Daniel and his daughter to review kindergarten application #B5821.
[REDACTED_FIRST_NAME] met with [REDACTED_FIRST_NAME] and his daughter to review kindergarten application #[REDACTED_ID].
<Alice, first_name> met with <Bob, first_name> and his daughter to review kindergarten application #<A9349, id>.
<HASH_FIRST_NAME_3bc51062973c> met with <HASH_FIRST_NAME_cd9fb1e148cc> and his daughter to review kindergarten application #<HASH_ID_f2a5f83e2a4c>.
The family met with the admissions counselor to review their school application.
Get Started¶
Install¶
pip install nemo-anonymizer
Setup¶
# Get an API key from build.nvidia.com
export NVIDIA_API_KEY="your-nvidia-api-key"
By default, Anonymizer uses NVIDIA-hosted models for detection and LLM-based anonymization. You can also bring your own models.
Record length
Records up to 2,000 tokens each work with the default model configs. Longer text will require adjustment of model providers and model configs.
Anonymize¶
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
input_data = AnonymizerInput(
source="patient_data.csv",
text_column="notes",
data_summary="Records containing detailed notes on patient encounters",
)
anonymizer = Anonymizer()
config = AnonymizerConfig(replace=Substitute())
anonymizer.validate_config(config)
preview = anonymizer.preview(
config=config,
data=input_data,
num_records=5,
)
preview.display_record()
# Inspect the preview, adjust parameters if needed, then run full data.
output = anonymizer.run(
config=config,
data=input_data,
)
# Preview a few rows first
anonymizer preview \
--source patient_data.csv \
--text-column notes \
--data-summary "Records containing detailed notes on patient encounters" \
--replace substitute \
--num-records 5
# Then run the full dataset
anonymizer run \
--source patient_data.csv \
--text-column notes \
--data-summary "Records containing detailed notes on patient encounters" \
--replace substitute \
--output patient_data_anonymized.csv
data_summary improves detection
data_summary is optional but recommended for domain-specific data. It helps the LLM find more entities and reduce false drops.
Inspect¶
View an interactive visualization with entity highlights.
preview.display_record()
Access the main results -- original text, entities, and transformed text.
preview.dataframe
Access the full pipeline trace with all internal columns.
preview.trace_dataframe