Skip to content

🕵️ NeMo Anonymizer

GitHub License Python 3.11+

NeMo Anonymizer detects and protects PII through context-aware replacement and rewriting. It offers high-quality user-guided entity detection, followed by modification options that maintain context while inducing privacy. You can review what sensitive information was found, adjust your masking strategy, and generate anonymized text.

Pick a strategy:

Alice met with Bob and his daughter to review kindergarten application #A9349.

Maya met with Daniel and his daughter to review kindergarten application #B5821.

[REDACTED_FIRST_NAME] met with [REDACTED_FIRST_NAME] and his daughter to review kindergarten application #[REDACTED_ID].

<Alice, first_name> met with <Bob, first_name> and his daughter to review kindergarten application #<A9349, id>.

<HASH_FIRST_NAME_3bc51062973c> met with <HASH_FIRST_NAME_cd9fb1e148cc> and his daughter to review kindergarten application #<HASH_ID_f2a5f83e2a4c>.

The family met with the admissions counselor to review their school application.


Get Started

Install

pip install nemo-anonymizer

Setup

# Get an API key from build.nvidia.com
export NVIDIA_API_KEY="your-nvidia-api-key"
By default, Anonymizer uses NVIDIA-hosted models for detection and LLM-based anonymization. You can also bring your own models.

Default hosted models are best for experimentation

The default build.nvidia.com setup is a convenient way to try Anonymizer and iterate on previews. For privacy-sensitive or production data, configure Anonymizer to use a secure endpoint you trust and to which you are comfortable sending data.

Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint if you need stronger privacy guarantees or higher throughput.

Record length

Records up to 2,000 tokens each work with the default model configs. Longer text will require adjustment of model providers and model configs.

Anonymize

from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute

input_data = AnonymizerInput(
    source="patient_data.csv",
    text_column="notes",
    data_summary="Records containing detailed notes on patient encounters",
)
anonymizer = Anonymizer()
config = AnonymizerConfig(replace=Substitute())

anonymizer.validate_config(config)

preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=5,
)
preview.display_record()

# Inspect the preview, adjust parameters if needed, then run full data.
output = anonymizer.run(
    config=config,
    data=input_data,
)
# Preview a few rows first
anonymizer preview \
  --source patient_data.csv \
  --text-column notes \
  --data-summary "Records containing detailed notes on patient encounters" \
  --replace substitute \
  --num-records 5

# Then run the full dataset
anonymizer run \
  --source patient_data.csv \
  --text-column notes \
  --data-summary "Records containing detailed notes on patient encounters" \
  --replace substitute \
  --output patient_data_anonymized.csv

data_summary improves detection

data_summary is optional but recommended for domain-specific data. It helps the LLM find more entities and reduce false drops.

Language And Regional Coverage

Anonymizer has been tested most extensively on English-language data. Multilingual quality has not yet been evaluated systematically across languages, domains, and models.

Although testing so far has been primarily in English, the supported entity set is not limited to U.S.-specific identifiers. Detection and anonymization can also apply to international formats such as non-U.S. phone numbers, addresses, legal references, and national or regional identification numbers, though coverage will vary by language, region, and model configuration.

If you are working with another language, we encourage you to experiment on a small sample first with preview(), validate detected entities and transformed output carefully, and adjust your model providers and model configs as needed.

Inspect

View an interactive visualization with entity highlights.

preview.display_record()
Access the main results -- original text, entities, and transformed text.
preview.dataframe
Access the full pipeline trace with all internal columns.
preview.trace_dataframe


Next up

  • Detect

    Refine how to search for entities.

  • Replace

    Customize how to replace entities -- substitute, redact, annotate, or hash.

  • Rewrite

    Generate a privacy-safe paraphrase of the entire text.

  • Tutorials

    End-to-end notebooks for replace and rewrite.