Skip to content

🕵️ NeMo Anonymizer

GitHub License Python 3.11+

NeMo Anonymizer detects and protects PII through context-aware replacement and rewriting. It offers high-quality user-guided entity detection, followed by modification options that maintain context while inducing privacy. You can review what sensitive information was found, adjust your masking strategy, and generate anonymized text.

Pick a strategy:

Alice met with Bob and his daughter to review kindergarten application #A9349.

Maya met with Daniel and his daughter to review kindergarten application #B5821.

[REDACTED_FIRST_NAME] met with [REDACTED_FIRST_NAME] and his daughter to review kindergarten application #[REDACTED_ID].

<Alice, first_name> met with <Bob, first_name> and his daughter to review kindergarten application #<A9349, id>.

<HASH_FIRST_NAME_3bc51062973c> met with <HASH_FIRST_NAME_cd9fb1e148cc> and his daughter to review kindergarten application #<HASH_ID_f2a5f83e2a4c>.

The family met with the admissions counselor to review their school application.


Get Started

Install

pip install nemo-anonymizer

Setup

# Get an API key from build.nvidia.com
export NVIDIA_API_KEY="your-nvidia-api-key"
By default, Anonymizer uses NVIDIA-hosted models for detection and LLM-based anonymization. You can also bring your own models.

Default hosted models are best for experimentation

The default build.nvidia.com (NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services.

Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint for production data and usage.

Record length

Records up to 2,000 tokens each work with the default model configs. Longer text will require adjustment of model providers and model configs.

Anonymize

from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute

input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles of individuals",
)
anonymizer = Anonymizer()
config = AnonymizerConfig(replace=Substitute())

anonymizer.validate_config(config)

preview = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=5,
)
preview.display_record()

# Inspect the preview, adjust parameters if needed, then run full data.
output = anonymizer.run(
    config=config,
    data=input_data,
)
# Preview a few rows first
anonymizer preview \
  --source "https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv" \
  --text-column biography \
  --data-summary "Biographical profiles of individuals" \
  --replace substitute \
  --num-records 5

# Then run the full dataset
anonymizer run \
  --source "https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv" \
  --text-column biography \
  --data-summary "Biographical profiles of individuals" \
  --replace substitute \
  --output biographies_anonymized.csv

data_summary improves detection

data_summary is optional but recommended for domain-specific data. It helps the LLM find more entities and reduce false drops.

Language And Regional Coverage

Anonymizer has been tested most extensively on English-language data. Multilingual quality has not yet been evaluated systematically across languages, domains, and models.

Although testing so far has been primarily in English, the supported entity set is not limited to U.S.-specific identifiers. Detection and anonymization can also apply to international formats such as non-U.S. phone numbers, addresses, legal references, and national or regional identification numbers, though coverage will vary by language, region, and model configuration.

If you are working with another language, we encourage you to experiment on a small sample first with preview(), validate detected entities and transformed output carefully, and adjust your model providers and model configs as needed.

Inspect

View an interactive visualization with entity highlights.

preview.display_record()
Access the main results -- original text, entities, and transformed text.
preview.dataframe
Access the full pipeline trace with all internal columns.
preview.trace_dataframe


Telemetry and Privacy

NeMo Anonymizer includes an optional function to share anonymous run-level telemetry with NVIDIA for product improvement. One event is emitted per Anonymizer.run() / Anonymizer.preview() invocation and contains only technical metadata:

  • Run outcome — final task status (completed / error / canceled) and wall-clock duration
  • Pipeline configuration — transformation type (annotate, redact, hash, substitute, rewrite), whether data_summary / privacy_goal / Substitute(instructions=...) were customized, max_repair_iterations, strict_entity_protection
  • Models used per step — model aliases for the detector, validator, augmenter, rewriter, etc. (whichever steps ran in this mode)
  • Model hosts — coarse classification of the inference endpoints used (nvidia-build, nvidia-internal, openrouter, local, other)
  • Aggregate counts — number of input records, success and failure counts, average tokens per record (estimated with tiktoken cl100k_base), and failure attribution by pipeline workflow
  • Deployment typesdk or cli

No user data, record contents, prompts, model outputs, or device information are collected. Aggregate usage data (such as which models are most popular) will be shared back with the community; it is not used to track any individual user behavior.

You may opt out of telemetry collection at any time. Opting out applies only to data collection by NeMo Anonymizer itself.

To disable telemetry in the SDK, set emit_telemetry=False on AnonymizerConfig:

config = AnonymizerConfig(replace=Redact(), emit_telemetry=False)

To disable telemetry for one CLI invocation, pass --no-emit-telemetry:

uv run anonymizer run --source data.csv --text-column text --replace redact --no-emit-telemetry

To disable telemetry for the current shell, set NEMO_TELEMETRY_ENABLED=false (other accepted disabling values: 0, no) in your environment before running:

export NEMO_TELEMETRY_ENABLED=false

Use of third-party endpoints, including NVIDIA Build: Anonymizer can be configured to use various inference endpoints, including build.nvidia.com, OpenRouter, or local model servers. If you choose to use a third-party endpoint, that endpoint's own terms of service and privacy practices apply independently of this library. Any opt-out you exercise within Anonymizer does not extend to data collection by your chosen endpoint.


Next up

  • Detect

    Refine how to search for entities.

  • Replace

    Customize how to replace entities -- substitute, redact, annotate, or hash.

  • Rewrite

    Generate a privacy-safe paraphrase of the entire text.

  • Tutorials

    End-to-end notebooks for replace and rewrite.