🕵️ Inspecting Detected Entities¶

Dig into the entity detection pipeline output -- what was detected, what the LLM validator kept or dropped, and where entities appear in the text.

This notebook is for users who need to debug detection quality, tune labels and/or thresholds, or investigate downstream replacement or rewriting results.

We use Annotate mode because it preserves the original text while tagging each entity with its label, making it ideal for reviewing detection quality.

📚 What you'll learn¶

Run the detection pipeline and inspect its output using Annotate mode
View tagged text with entities marked inline
Break down detected entities by label, source, and unique value
Identify and triage failed records

Tip: First time running notebooks? Start with setup instructions.

⚙️ Setup¶

Check if your NVIDIA_API_KEY from build.nvidia.com is registered for model access.
- Treat the default build.nvidia.com setup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data.
- Request/token rate limits on build.nvidia.com vary by account and model access, and lower-volume development access can be slow for full runs. Start with preview() on a small sample.
Import the core classes -- this notebook uses Annotate to keep original values visible.
Anonymizer() initializes with the default model provider -- no extra config needed.
Anonymizer.configure_logging() controls verbosity -- switch to Anonymizer.configure_logging(LoggingConfig.debug()) when troubleshooting.

In [1]:

Copied!





import getpass
import os
from collections import Counter

import pandas as pd

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
from collections import Counter

import pandas as pd

if not os.getenv("NVIDIA_API_KEY"):
    key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
    if not key:
        raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
    os.environ["NVIDIA_API_KEY"] = key

In [2]:

Copied!

from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput

In [3]:

Copied!

anonymizer = Anonymizer()
anonymizer = Anonymizer()

[13:39:37] [INFO] 🔧 Anonymizer initialized with 3 model configs

[13:39:37] [INFO]   |-- 🔎 detector:  gliner-pii-detector

[13:39:37] [INFO]   |-- ✅ validator: gpt-oss-120b

[13:39:37] [INFO]   |-- 🧩 augmenter: gpt-oss-120b

👁️ Preview¶

Detection runs as part of any strategy. Annotate keeps original text visible alongside entity labels -- ideal for debugging.
trace_dataframe exposes every internal pipeline column; that's what we explore below.

In [4]:

Copied!





config = AnonymizerConfig(replace=Annotate())

input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

result = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)
config = AnonymizerConfig(replace=Annotate())

input_data = AnonymizerInput(
    source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
    text_column="biography",
    data_summary="Biographical profiles",
)

result = anonymizer.preview(
    config=config,
    data=input_data,
    num_records=3,
)

[13:39:37] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')

[13:39:37] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)

[13:39:37] [INFO]   |-- 👀 Preview mode: processing 3 of 25 records

[13:39:37] [INFO] 🔍 Running entity detection on 3 records

[13:40:17] [INFO]   |-- 📋 Detection complete — 79 entities found across 3 records (0 failed) [40.1s]

[13:40:17] [INFO]   |-- labels: first_name=22, organization_name=6, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, education_level=1, age_group=1, place_name=1, date_of_birth=1, employment_status=1, company_name=1

[13:40:17] [INFO] 🔄 Running Annotate replacement

[13:40:17] [INFO]   |-- 📋 Replacement complete (0 failed) [0.0s]

[13:40:17] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures

🔍 Inspect¶

display_record() renders an interactive view with entity highlights.

In [5]:

Copied!

result.display_record(0)
result.display_record(0)

Anonymizer Preview (record 0)

Original

Bobby| first_name Watford| last_name, a 40| age‑year‑old Mexican| race_ethnicity veterinarian| occupation living in Denver| city, Colorado| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from Jefferson High| education_level, he earned his DVM| degree at the University of Colorado Boulder| university, where he also completed a research stint in wildlife health| field_of_study. Fluent in English| language, Bobby| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, Bobby| first_name has worked at VCA Animal Hospital| organization_name and later at the Colorado Veterinary Clinic| organization_name, where he now leads a busy mixed‑practice team. He identifies as a Christian Democrat| political_view and often volunteers at local shelters, a habit encouraged by his wife, Maya| first_name, and their two teenage| age_group children, Aria and Leo| first_name. Outside the clinic, Bobby| first_name enjoys hiking the Rockies| place_name with his family and mentoring veterinary students from his alma mater.

Replaced

<Bobby, first_name>| first_name <Watford, last_name>| last_name, a <40, age>| age‑year‑old <Mexican, race_ethnicity>| race_ethnicity <veterinarian, occupation>| occupation living in <Denver, city>| city, <Colorado, state>| state, grew up on the outskirts of the city and developed a love for animals early on. After graduating from <Jefferson High, education_level>| education_level, he earned his <DVM, degree>| degree at the <University of Colorado Boulder, university>| university, where he also completed a research stint in <wildlife health, field_of_study>| field_of_study. Fluent in <English, language>| language, <Bobby, first_name>| first_name has always described his upbringing as a blend of small‑town curiosity and the vibrant culture of his community, values that continue to shape his compassionate approach to animal care.

Since finishing his training, <Bobby, first_name>| first_name has worked at <VCA Animal Hospital, organization_name>| organization_name and later at the <Colorado Veterinary Clinic, organization_name>| organization_name, where he now leads a busy mixed‑practice team. He identifies as a <Christian Democrat, political_view>| political_view and often volunteers at local shelters, a habit encouraged by his wife, <Maya, first_name>| first_name, and their two <teenage, age_group>| age_group children, <Aria and Leo, first_name>| first_name. Outside the clinic, <Bobby, first_name>| first_name enjoys hiking the <Rockies, place_name>| place_name with his family and mentoring veterinary students from his alma mater.

Replacement Map

Original	Label	Replacement
Bobby	first_name	<Bobby, first_name>
Watford	last_name	<Watford, last_name>
40	age	<40, age>
Mexican	race_ethnicity	<Mexican, race_ethnicity>
veterinarian	occupation	<veterinarian, occupation>
Denver	city	<Denver, city>
Colorado	state	<Colorado, state>
Jefferson High	education_level	<Jefferson High, education_level>
DVM	degree	<DVM, degree>
University of Colorado Boulder	university	<University of Colorado Boulder, university>
wildlife health	field_of_study	<wildlife health, field_of_study>
English	language	<English, language>
VCA Animal Hospital	organization_name	<VCA Animal Hospital, organization_name>
Colorado Veterinary Clinic	organization_name	<Colorado Veterinary Clinic, organization_name>
Christian Democrat	political_view	<Christian Democrat, political_view>
Maya	first_name	<Maya, first_name>
teenage	age_group	<teenage, age_group>
Aria and Leo	first_name	<Aria and Leo, first_name>
Rockies	place_name	<Rockies, place_name>

📋 Columns¶

trace_dataframe contains all internal columns from the pipeline (detection, validation, replacement, etc.).

In [6]:

Copied!

df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")

Records: 3
Columns: ['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_initial_tagged_text', '_seed_entities_json', '_tag_notation', '_merged_tagged_text', '_validation_candidates', '_validated_entities', '_augmented_entities', '_merged_entities', '_detected_entities', 'biography_with_spans', 'final_entities', '_entities_by_value', '_replacement_map', 'biography_replaced']

🎯 Detected entities¶

Final entity list after validation. Each entity has value, label, positions, score, and source (detector / augmenter / name_split / propagation).

In [7]:

Copied!





row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")

entity_df = pd.DataFrame(entities)
if not entity_df.empty:
    cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
    print(entity_df[cols].to_string())
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")

entity_df = pd.DataFrame(entities)
if not entity_df.empty:
    cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
    print(entity_df[cols].to_string())

Record 0: 22 entities detected

                             value              label  start_position  end_position     source
0                            Bobby         first_name               0             5   detector
1                          Watford          last_name               6            13   detector
2                               40                age              17            19   detector
3                          Mexican     race_ethnicity              29            36   detector
4                     veterinarian         occupation              37            49   detector
5                           Denver               city              60            66   detector
6                         Colorado              state              68            76   detector
7                   Jefferson High    education_level             180           194   detector
8                              DVM             degree             210           213   detector
9   University of Colorado Boulder         university             221           251   detector
10                 wildlife health     field_of_study             297           312   detector
11                         English           language             324           331   detector
12                           Bobby         first_name             333           338   detector
13                           Bobby         first_name             556           561   detector
14             VCA Animal Hospital  organization_name             576           595   detector
15      Colorado Veterinary Clinic  organization_name             613           639   detector
16              Christian Democrat     political_view             707           725   detector
17                            Maya         first_name             798           802   detector
18                         teenage          age_group             818           825  augmenter
19                    Aria and Leo         first_name             836           848   detector
20                           Bobby         first_name             870           875   detector
21                         Rockies         place_name             894           901   detector

🏷️ Labels¶

Entity label distribution across all records -- which types are most common.

In [8]:

Copied!





label_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        label_counts[entity["label"]] += 1

for label, count in label_counts.most_common():
    print(f"  {label}: {count}")
label_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        label_counts[entity["label"]] += 1

for label, count in label_counts.most_common():
    print(f"  {label}: {count}")

  first_name: 22
  organization_name: 6
  age: 5
  occupation: 5
  city: 4
  state: 4
  degree: 4
  university: 4
  field_of_study: 4
  last_name: 3
  race_ethnicity: 3
  political_view: 3
  language: 2
  religious_belief: 2
  street_address: 2
  education_level: 1
  age_group: 1
  place_name: 1
  date_of_birth: 1
  employment_status: 1
  company_name: 1

📡 Sources¶

Where each entity came from in the pipeline:
- detector -- GLiNER NER
- augmenter -- LLM-added (missed by GLiNER)
- validator -- LLM decision step over detector-seed entities (keep/reclass/drop); does not emit a separate source value
- name_split -- derived from splitting full names
- propagation -- expanded from validated entities to all text occurrences

In [9]:

Copied!





source_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        source_counts[entity.get("source", "unknown")] += 1

for source, count in source_counts.most_common():
    print(f"  {source}: {count}")
source_counts = Counter()
for raw in df["_detected_entities"]:
    entity_list = raw["entities"] if isinstance(raw, dict) else raw
    for entity in entity_list:
        source_counts[entity.get("source", "unknown")] += 1

for source, count in source_counts.most_common():
    print(f"  {source}: {count}")

  detector: 78
  augmenter: 1

📊 By value¶

Entities grouped by unique value -- this is what drives consistent replacement downstream (same name always maps to the same substitute).

In [10]:

Copied!





row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")

for entry in by_value:
    print(f"  {entry['value']!r} -> labels: {entry['labels']}")
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")

for entry in by_value:
    print(f"  {entry['value']!r} -> labels: {entry['labels']}")

Record 0: 19 unique entity values

  '40' -> labels: ['age']
  'Aria and Leo' -> labels: ['first_name']
  'Bobby' -> labels: ['first_name']
  'Christian Democrat' -> labels: ['political_view']
  'Colorado' -> labels: ['state']
  'Colorado Veterinary Clinic' -> labels: ['organization_name']
  'DVM' -> labels: ['degree']
  'Denver' -> labels: ['city']
  'English' -> labels: ['language']
  'Jefferson High' -> labels: ['education_level']
  'Maya' -> labels: ['first_name']
  'Mexican' -> labels: ['race_ethnicity']
  'Rockies' -> labels: ['place_name']
  'University of Colorado Boulder' -> labels: ['university']
  'VCA Animal Hospital' -> labels: ['organization_name']
  'Watford' -> labels: ['last_name']
  'teenage' -> labels: ['age_group']
  'veterinarian' -> labels: ['occupation']
  'wildlife health' -> labels: ['field_of_study']

❌ Failures¶

Records dropped during detection (LLM timeout, parse error, etc.).
Check this to understand data loss in your pipeline.

In [11]:

Copied!





if result.failed_records:
    for fr in result.failed_records:
        print(f"  record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
    print("No failed records.")
if result.failed_records:
    for fr in result.failed_records:
        print(f"  record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
    print("No failed records.")

No failed records.

⏭️ Next steps¶

🕵️ Your First Anonymization -- the simplest end-to-end replace workflow if you haven't run it yet.
🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.