In [1]:

Copied!





# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#   kernelspec:
#     display_name: Python 3
#     language: python
#     name: python3
# ---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#   kernelspec:
#     display_name: Python 3
#     language: python
#     name: python3
# ---

🕵️ Inspecting Detected Entities¶

Dig into the entity detection pipeline output -- what was detected, what the LLM validator kept or dropped, and where entities appear in the text.

This notebook is for users who need to debug detection quality, tune labels and/or thresholds, or investigate downstream replacement or rewriting results.

We use Annotate mode because it preserves the original text while tagging each entity with its label, making it ideal for reviewing detection quality.

📚 What you'll learn¶

Run the detection pipeline and inspect its output using Annotate mode
View tagged text with entities marked inline
Break down detected entities by label, source, and unique value
Identify and triage failed records

Tip: First time running notebooks? Start with setup instructions.

⚙️ Setup¶

Check if your NVIDIA_API_KEY from build.nvidia.com is registered for model access.
- The default build.nvidia.com (NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services.
- Request and token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start with preview() on a small sample, then move to your own endpoint for production data and usage.
Import the core classes -- this notebook uses Annotate to keep original values visible.
Anonymizer() initializes with the default model provider -- no extra config needed.
configure_logging(LoggingConfig.default()) keeps logs at INFO. Switch to LoggingConfig.debug() when troubleshooting.

In [2]:

Copied!





import getpass
import os
from collections import Counter

import pandas as pd

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
from collections import Counter

import pandas as pd

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key

In [ ]:

Copied!

from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, configure_logging

configure_logging(LoggingConfig.default())
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, configure_logging

configure_logging(LoggingConfig.default())

In [4]:

Copied!

anonymizer = Anonymizer()
anonymizer = Anonymizer()

[13:16:40] [INFO] 🔧 Anonymizer initialized with 3 model configs

[13:16:40] [INFO]   |-- 🔎 detector:  gliner-pii-detector

[13:16:40] [INFO]   |-- ✅ validator: gpt-oss-120b

[13:16:40] [INFO]   |-- 🧩 augmenter: gpt-oss-120b

👁️ Preview¶

Detection runs as part of any strategy. Annotate keeps original text visible alongside entity labels -- ideal for debugging.
trace_dataframe exposes every internal pipeline column; that's what we explore below.

In [5]:

Copied!





config = AnonymizerConfig(replace=Annotate())

input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

result = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)
config = AnonymizerConfig(replace=Annotate())

input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

result = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

[13:16:40] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')

[13:16:40] [INFO] 🔍 Running entity detection on 3 records

[13:16:40] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)

[13:17:12] [INFO]   |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [31.9s]

[13:17:12] [INFO]   |-- labels: first_name=23, organization_name=7, state=6, age=5, occupation=5, city=5, last_name=3, race_ethnicity=3, language=3, company_name=3, political_view=3, education_level=3, street_address=2, degree=1, university=1, date_of_birth=1, field_of_study=1, employment_status=1, religious_belief=1

[13:17:12] [INFO] 🔄 Running Annotate replacement

[13:17:12] [INFO]   |-- 📋 Replacement complete (0 failed) [0.0s]

[13:17:12] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures

🔍 Inspect¶

display_record() renders an interactive view with entity highlights.

In [6]:

Copied!

result.display_record(0)
result.display_record(0)

Anonymizer Preview (record 0)

Original

Bobby| first_name Watford| last_name, a 40| age‑year‑old Mexican| race_ethnicity veterinarian| occupation living in Denver| city, Colorado| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High| organization_name, he earned his DVM| degree at the University of Colorado Boulder| university, where he also completed a research stint in wildlife health. Fluent in English| language, Bobby| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Bobby| first_name has worked at VCA Animal Hospital| company_name and later at the Colorado Veterinary Clinic| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Christian Democrat| political_view and often volunteers at local shelters, a habit encouraged by his wife, Maya| first_name, and their two teenage children, Aria| first_name and Leo| first_name. Outside the clinic, Bobby| first_name enjoys hiking the Rockies with his family and mentoring veterinary students from his alma mater.

Replaced

<Bobby, first_name>| first_name <Watford, last_name>| last_name, a <40, age>| age‑year‑old <Mexican, race_ethnicity>| race_ethnicity <veterinarian, occupation>| occupation living in <Denver, city>| city, <Colorado, state>| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from <Jefferson High, organization_name>| organization_name, he earned his <DVM, degree>| degree at the <University of Colorado Boulder, university>| university, where he also completed a research stint in wildlife health. Fluent in <English, language>| language, <Bobby, first_name>| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, <Bobby, first_name>| first_name has worked at <VCA Animal Hospital, company_name>| company_name and later at the <Colorado Veterinary Clinic, organization_name>| organization_name, where he now leads a busy mixed‑practice team. He identifies as a <Christian Democrat, political_view>| political_view and often volunteers at local shelters, a habit encouraged by his wife, <Maya, first_name>| first_name, and their two teenage children, <Aria, first_name>| first_name and <Leo, first_name>| first_name. Outside the clinic, <Bobby, first_name>| first_name enjoys hiking the Rockies with his family and mentoring veterinary students from his alma mater.

Replacement Map

Original	Label	Replacement
Bobby	first_name	<Bobby, first_name>
Watford	last_name	<Watford, last_name>
40	age	<40, age>
Mexican	race_ethnicity	<Mexican, race_ethnicity>
veterinarian	occupation	<veterinarian, occupation>
Denver	city	<Denver, city>
Colorado	state	<Colorado, state>
Jefferson High	organization_name	<Jefferson High, organization_name>
DVM	degree	<DVM, degree>
University of Colorado Boulder	university	<University of Colorado Boulder, university>
English	language	<English, language>
VCA Animal Hospital	company_name	<VCA Animal Hospital, company_name>
Colorado Veterinary Clinic	organization_name	<Colorado Veterinary Clinic, organization_name>
Christian Democrat	political_view	<Christian Democrat, political_view>
Maya	first_name	<Maya, first_name>
Aria	first_name	<Aria, first_name>
Leo	first_name	<Leo, first_name>

📋 Columns¶

trace_dataframe contains all internal columns from the pipeline (detection, validation, replacement, etc.).

In [7]:

Copied!

df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")

Records: 3
Columns: ['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_tag_notation', '_seed_validation_candidates', '_seed_tagged_text', '_validated_entities', '_seed_entities_json', '_initial_tagged_text', '_validated_seed_entities', '_augmented_entities', '_merged_entities', '_merged_tagged_text', '_validation_candidates', '_detected_entities', 'biography_with_spans', 'final_entities', '_entities_by_value', '_replacement_map', 'biography_replaced']

🎯 Detected entities¶

Final entity list after validation. Each entity has value, label, positions, score, and source (detector / augmenter / name_split / propagation).

In [8]:

Copied!





row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")

entity_df = pd.DataFrame(entities)
if not entity_df.empty:
    cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
    print(entity_df[cols].to_string())
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")

entity_df = pd.DataFrame(entities)
if not entity_df.empty:
    cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
    print(entity_df[cols].to_string())

Record 0: 20 entities detected

                             value              label  start_position  end_position     source
0                            Bobby         first_name               0             5   detector
1                          Watford          last_name               6            13   detector
2                               40                age              17            19   detector
3                          Mexican     race_ethnicity              29            36   detector
4                     veterinarian         occupation              37            49   detector
5                           Denver               city              60            66   detector
6                         Colorado              state              68            76   detector
7                   Jefferson High  organization_name             180           194  augmenter
8                              DVM             degree             210           213   detector
9   University of Colorado Boulder         university             221           251  augmenter
10                         English           language             324           331   detector
11                           Bobby         first_name             333           338   detector
12                           Bobby         first_name             556           561   detector
13             VCA Animal Hospital       company_name             576           595   detector
14      Colorado Veterinary Clinic  organization_name             613           639  augmenter
15              Christian Democrat     political_view             707           725   detector
16                            Maya         first_name             798           802   detector
17                            Aria         first_name             836           840  augmenter
18                             Leo         first_name             845           848  augmenter
19                           Bobby         first_name             870           875   detector

🏷️ Labels¶

Entity label distribution across all records -- which types are most common.

In [9]:

Copied!





label_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        label_counts[entity["label"]] += 1

for label, count in label_counts.most_common():
    print(f"  {label}: {count}")
label_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        label_counts[entity["label"]] += 1

for label, count in label_counts.most_common():
    print(f"  {label}: {count}")

  first_name: 23
  organization_name: 7
  state: 6
  age: 5
  occupation: 5
  city: 5
  last_name: 3
  race_ethnicity: 3
  language: 3
  company_name: 3
  political_view: 3
  education_level: 3
  street_address: 2
  degree: 1
  university: 1
  date_of_birth: 1
  field_of_study: 1
  employment_status: 1
  religious_belief: 1

📡 Sources¶

Where each entity came from in the pipeline:
- detector -- GLiNER NER
- augmenter -- LLM-added (missed by GLiNER)
- validator -- LLM decision step over detector-seed entities (keep/reclass/drop); does not emit a separate source value
- name_split -- derived from splitting full names
- propagation -- expanded from validated entities to all text occurrences

In [10]:

Copied!





source_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        source_counts[entity.get("source", "unknown")] += 1

for source, count in source_counts.most_common():
    print(f"  {source}: {count}")
source_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        source_counts[entity.get("source", "unknown")] += 1

for source, count in source_counts.most_common():
    print(f"  {source}: {count}")

  detector: 68
  augmenter: 9

📊 By value¶

Entities grouped by unique value -- this is what drives consistent replacement downstream (same name always maps to the same substitute).

In [11]:

Copied!





row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")

for entry in by_value:
    print(f"  {entry['value']!r} -> labels: {entry['labels']}")
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")

for entry in by_value:
    print(f"  {entry['value']!r} -> labels: {entry['labels']}")

Record 0: 17 unique entity values

  '40' -> labels: ['age']
  'Aria' -> labels: ['first_name']
  'Bobby' -> labels: ['first_name']
  'Christian Democrat' -> labels: ['political_view']
  'Colorado' -> labels: ['state']
  'Colorado Veterinary Clinic' -> labels: ['organization_name']
  'DVM' -> labels: ['degree']
  'Denver' -> labels: ['city']
  'English' -> labels: ['language']
  'Jefferson High' -> labels: ['organization_name']
  'Leo' -> labels: ['first_name']
  'Maya' -> labels: ['first_name']
  'Mexican' -> labels: ['race_ethnicity']
  'University of Colorado Boulder' -> labels: ['university']
  'VCA Animal Hospital' -> labels: ['company_name']
  'Watford' -> labels: ['last_name']
  'veterinarian' -> labels: ['occupation']

❌ Failures¶

Records dropped during detection (LLM timeout, parse error, etc.).
Check this to understand data loss in your pipeline.

In [12]:

Copied!





if result.failed_records:
    for fr in result.failed_records:
        print(f"  record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
    print("No failed records.")
if result.failed_records:
    for fr in result.failed_records:
        print(f"  record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
    print("No failed records.")

No failed records.

📊 (Optional) Score the detections with an LLM judge¶

evaluate() is a separate, opt-in step that runs LLM-as-judge metrics on the output.
This notebook uses Annotate, so only Detection Validity runs — it flags entities the detector got wrong (false positives, mislabels, boundary errors). Substitute would also enable Type Fidelity, Relational Consistency, and Attribute Fidelity.

In [ ]:

Copied!

evaluated = anonymizer.evaluate(result)
evaluated.display_record(0)
evaluated = anonymizer.evaluate(result)
evaluated.display_record(0)

⏭️ Next steps¶

🕵️ Your First Anonymization -- the simplest end-to-end replace workflow if you haven't run it yet.
🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.