🕵️ NeMo Anonymizer¶

NeMo Anonymizer detects and protects PII through context-aware replacement and rewriting. It offers high-quality user-guided entity detection, followed by modification options that maintain context while inducing privacy. You can review what sensitive information was found, adjust your masking strategy, and generate anonymized text.

Pick a strategy:

Alice met with Bob and his daughter to review kindergarten application #A9349.

SubstituteRedactAnnotateHashRewrite

Maya met with Daniel and his daughter to review kindergarten application #B5821.

[REDACTED_FIRST_NAME] met with [REDACTED_FIRST_NAME] and his daughter to review kindergarten application #[REDACTED_ID].

<Alice, first_name> met with <Bob, first_name> and his daughter to review kindergarten application #<A9349, id>.

<HASH_FIRST_NAME_3bc51062973c> met with <HASH_FIRST_NAME_cd9fb1e148cc> and his daughter to review kindergarten application #<HASH_ID_f2a5f83e2a4c>.

The family met with the admissions counselor to review their school application.

Get Started¶

Install¶

pip install nemo-anonymizer

Setup¶

# Get an API key from build.nvidia.com
export NVIDIA_API_KEY="your-nvidia-api-key"

By default, Anonymizer uses NVIDIA-hosted models for detection and LLM-based anonymization. You can also bring your own models.

Record length

Records up to 2,000 tokens each work with the default model configs. Longer text will require adjustment of model providers and model configs.

Anonymize¶

PythonCLI

from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute

input_data = AnonymizerInput(
    source="patient_data.csv",
    text_column="notes",
    data_summary="Records containing detailed notes on patient encounters",
)
anonymizer = Anonymizer()
config = AnonymizerConfig(replace=Substitute())

anonymizer.validate_config(config)

preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=5,
)
preview.display_record()

# Inspect the preview, adjust parameters if needed, then run full data.
output = anonymizer.run(
    config=config,
    data=input_data,
)

# Preview a few rows first
anonymizer preview \
  --source patient_data.csv \
  --text-column notes \
  --data-summary "Records containing detailed notes on patient encounters" \
  --replace substitute \
  --num-records 5

# Then run the full dataset
anonymizer run \
  --source patient_data.csv \
  --text-column notes \
  --data-summary "Records containing detailed notes on patient encounters" \
  --replace substitute \
  --output patient_data_anonymized.csv

data_summary improves detection

data_summary is optional but recommended for domain-specific data. It helps the LLM find more entities and reduce false drops.

Inspect¶

View an interactive visualization with entity highlights.

preview.display_record()

Access the main results -- original text, entities, and transformed text.

preview.dataframe

Access the full pipeline trace with all internal columns.

preview.trace_dataframe

Next up¶

Detect

Refine how to search for entities.
Replace

Customize how to replace entities -- substitute, redact, annotate, or hash.
Rewrite

Generate a privacy-safe paraphrase of the entire text.
Tutorials

End-to-end notebooks for replace and rewrite.