đľď¸ Your First AnonymizationÂś
Detect sensitive entities and replace them with LLM-generated substitutes -- the simplest end-to-end example of Anonymizer.
đ What you'll learnÂś
- Load a CSV dataset and configure Anonymizer in a few lines
- Preview anonymized results on a small sample before committing to a full run
- Inspect entity detection and replacement with
display_record() - Process the full dataset with
run()
Tip: First time running notebooks? Start with setup instructions.
âď¸ SetupÂś
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- The default
build.nvidia.com(NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services. - Request and token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start withpreview()on a small sample, then move to your own endpoint for production data and usage.
- The default
- Import the core Anonymizer classes:
Anonymizer,AnonymizerConfig,AnonymizerInput, andSubstitute. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(LoggingConfig.default())keeps logs at INFO. Switch toLoggingConfig.debug()when troubleshooting.
In [2]:
Copied!
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [ ]:
Copied!
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, Substitute, configure_logging
configure_logging(LoggingConfig.default())
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, Substitute, configure_logging
configure_logging(LoggingConfig.default())
In [4]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:13:46] [INFO] đ§ Anonymizer initialized with 3 model configs
[13:13:46] [INFO] |-- đ detector: gliner-pii-detector
[13:13:46] [INFO] |-- â validator: gpt-oss-120b
[13:13:46] [INFO] |-- đ§Š augmenter: gpt-oss-120b
đŚ Load data and configureÂś
AnonymizerInputpoints to your CSV and names the text column.data_summarygives the LLM context about the kind of text it will process.- Records up to 2,000 tokens each work with the default model configs.
AnonymizerConfigwithSubstitute()tells Anonymizer to replace detected entities with LLM-generated synthetic values for names, cities, dates, etc.
In [5]:
Copied!
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles of individuals",
)
config = AnonymizerConfig(replace=Substitute())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles of individuals",
)
config = AnonymizerConfig(replace=Substitute())
đď¸ PreviewÂś
preview()runs on a small sample so you can iterate quickly.- Always preview before processing the full dataset -- it's the fastest way to catch prompt or config issues early.
In [6]:
Copied!
preview = anonymizer.preview(config=config, data=input_data, num_records=3)
preview = anonymizer.preview(config=config, data=input_data, num_records=3)
[13:13:46] [INFO] đ Preview mode: đ Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:13:46] [INFO] đ Running entity detection on 3 records
[13:13:46] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:14:17] [INFO] |-- đ Detection complete â 80 entities found across 3 records (0 failed) [30.6s]
[13:14:17] [INFO] |-- labels: first_name=23, state=6, organization_name=6, age=5, occupation=5, city=5, company_name=4, last_name=3, race_ethnicity=3, language=3, political_view=3, education_level=3, field_of_study=2, religious_belief=2, street_address=2, degree=1, university=1, place_name=1, date_of_birth=1, employment_status=1
[13:14:17] [INFO] đ Running Substitute replacement
[13:15:14] [INFO] |-- đ Replacement complete (0 failed) [57.4s]
[13:15:14] [INFO] đ Pipeline complete â 3 records processed, 0 total failures
đ InspectÂś
display_record()shows the original text with highlighted entities, the replacement map, and the anonymized output -- all in one view.- The result dataframe has original and substituted text side-by-side.
In [7]:
Copied!
preview.display_record(0)
preview.display_record(0)
In [8]:
Copied!
preview.display_record(1)
preview.display_record(1)
In [9]:
Copied!
preview.dataframe
preview.dataframe
Out[9]:
| biography | biography_with_spans | final_entities | biography_replaced | |
|---|---|---|---|---|
| 0 | Bobby Watford, a 40âyearâold Mexican veterinar... | <first_name>Bobby</first_name> <last_name>Watf... | {'entities': [{'end_position': 5, 'id': 'first... | Ethan Henderson, a 45âyearâold Vietnamese mari... |
| 1 | Idilio Bell is a 37âyearâold astronomer living... | <first_name>Idilio</first_name> <last_name>Bel... | {'entities': [{'end_position': 6, 'id': 'first... | Santiago Kumar is a 36âyearâold geophysicist l... |
| 2 | Jodi Allison,âŻ36, lives at 204âŻBluegrass in Cl... | <first_name>Jodi</first_name> <last_name>Allis... | {'entities': [{'end_position': 4, 'id': 'first... | Sofia Keller,âŻ42, lives at 587âŻMaple in Macon,... |
đ Full runÂś
run()processes the entire dataset with the same config you previewed.- Access the output via
result.dataframe.
In [10]:
Copied!
result = anonymizer.run(config=config, data=input_data)
print(result)
result = anonymizer.run(config=config, data=input_data)
print(result)
[13:15:14] [INFO] đ Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:15:14] [INFO] đ Running entity detection on 25 records
[13:15:14] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:16:05] [INFO] |-- đ Detection complete â 648 entities found across 25 records (0 failed) [50.3s]
[13:16:05] [INFO] |-- labels: first_name=152, city=48, occupation=45, company_name=40, education_level=33, race_ethnicity=31, state=30, organization_name=30, last_name=27, age=26, political_view=26, religious_belief=25, street_address=23, university=21, language=21, field_of_study=13, place_name=12, county=11, employment_status=10, date_of_birth=9, date=5, degree=4, school_name=1, landmark=1, journal_name=1, country=1, gender=1, postcode=1
[13:16:05] [INFO] đ Running Substitute replacement
[13:16:35] [INFO] |-- đ Replacement complete (0 failed) [30.5s]
[13:16:35] [INFO] đ Pipeline complete â 25 records processed, 0 total failures
AnonymizerResult(rows=25, columns=4, trace_columns=21, failed_records=0)
In [11]:
Copied!
result.dataframe.head()
result.dataframe.head()
Out[11]:
| biography | biography_with_spans | final_entities | biography_replaced | |
|---|---|---|---|---|
| 0 | Bobby Watford, a 40âyearâold Mexican veterinar... | <first_name>Bobby</first_name> <last_name>Watf... | {'entities': array([{'end_position': 5, 'id': ... | Ethan Hernandez, a 52âyearâold Filipino zoolog... |
| 1 | Idilio Bell is a 37âyearâold astronomer living... | <first_name>Idilio</first_name> <last_name>Bel... | {'entities': array([{'end_position': 6, 'id': ... | Rafael Khan is a 42âyearâold planetary geologi... |
| 2 | Jodi Allison,âŻ36, lives at 204âŻBluegrass in Cl... | <first_name>Jodi</first_name> <last_name>Allis... | {'entities': array([{'end_position': 4, 'id': ... | Leah Harper,âŻ42, lives at 204 Willow in Eugene... |
| 3 | James Mills is a 69âyearâold paramedic who liv... | <first_name>James</first_name> <last_name>Mill... | {'entities': array([{'end_position': 5, 'id': ... | Ethan Harper is a 71âyearâold firefighter who ... |
| 4 | Nancy Burton is a 21âyearâold cashier who live... | <first_name>Nancy</first_name> <last_name>Burt... | {'entities': array([{'end_position': 5, 'id': ... | Leah Hawkins is a 27âyearâold stock clerk who ... |
âď¸ Next stepsÂś
- đ Inspecting Detected Entities -- dig into what the detection pipeline found and debug quality.
- đŻ Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
- âď¸ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.