🕵️ Inspecting Detected Entities¶
Dig into the entity detection pipeline output -- what was detected, what the LLM validator kept or dropped, and where entities appear in the text.
This notebook is for users who need to debug detection quality, tune labels and/or thresholds, or investigate downstream replacement or rewriting results.
We use Annotate mode because it preserves the original text while tagging each entity with its label, making it ideal for reviewing detection quality.
📚 What you'll learn¶
- Run the detection pipeline and inspect its output using Annotate mode
- View tagged text with entities marked inline
- Break down detected entities by label, source, and unique value
- Identify and triage failed records
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- Treat the default
build.nvidia.comsetup as a convenient experimentation path. For privacy-sensitive or production data, switch to a secure endpoint you trust and to which you are comfortable sending data. - Request/token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full runs. Start withpreview()on a small sample.
- Treat the default
- Import the core classes -- this notebook uses
Annotateto keep original values visible. Anonymizer()initializes with the default model provider -- no extra config needed.Anonymizer.configure_logging()controls verbosity -- switch toAnonymizer.configure_logging(LoggingConfig.debug())when troubleshooting.
In [1]:
Copied!
import getpass
import os
from collections import Counter
import pandas as pd
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
from collections import Counter
import pandas as pd
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [2]:
Copied!
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput
In [3]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:39:37] [INFO] 🔧 Anonymizer initialized with 3 model configs
[13:39:37] [INFO] |-- 🔎 detector: gliner-pii-detector
[13:39:37] [INFO] |-- ✅ validator: gpt-oss-120b
[13:39:37] [INFO] |-- 🧩 augmenter: gpt-oss-120b
👁️ Preview¶
- Detection runs as part of any strategy.
Annotatekeeps original text visible alongside entity labels -- ideal for debugging. trace_dataframeexposes every internal pipeline column; that's what we explore below.
In [4]:
Copied!
config = AnonymizerConfig(replace=Annotate())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
result = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
config = AnonymizerConfig(replace=Annotate())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
result = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
[13:39:37] [INFO] 📂 Loaded 25 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:39:37] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:39:37] [INFO] |-- 👀 Preview mode: processing 3 of 25 records
[13:39:37] [INFO] 🔍 Running entity detection on 3 records
[13:40:17] [INFO] |-- 📋 Detection complete — 79 entities found across 3 records (0 failed) [40.1s]
[13:40:17] [INFO] |-- labels: first_name=22, organization_name=6, age=5, occupation=5, city=4, state=4, degree=4, university=4, field_of_study=4, last_name=3, race_ethnicity=3, political_view=3, language=2, religious_belief=2, street_address=2, education_level=1, age_group=1, place_name=1, date_of_birth=1, employment_status=1, company_name=1
[13:40:17] [INFO] 🔄 Running Annotate replacement
[13:40:17] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:40:17] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🔍 Inspect¶
display_record()renders an interactive view with entity highlights.
In [5]:
Copied!
result.display_record(0)
result.display_record(0)
📋 Columns¶
trace_dataframecontains all internal columns from the pipeline (detection, validation, replacement, etc.).
In [6]:
Copied!
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
Records: 3 Columns: ['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_initial_tagged_text', '_seed_entities_json', '_tag_notation', '_merged_tagged_text', '_validation_candidates', '_validated_entities', '_augmented_entities', '_merged_entities', '_detected_entities', 'biography_with_spans', 'final_entities', '_entities_by_value', '_replacement_map', 'biography_replaced']
🎯 Detected entities¶
- Final entity list after validation. Each entity has
value,label, positions,score, andsource(detector / augmenter / name_split / propagation).
In [7]:
Copied!
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")
entity_df = pd.DataFrame(entities)
if not entity_df.empty:
cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
print(entity_df[cols].to_string())
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")
entity_df = pd.DataFrame(entities)
if not entity_df.empty:
cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
print(entity_df[cols].to_string())
Record 0: 22 entities detected
value label start_position end_position source
0 Bobby first_name 0 5 detector
1 Watford last_name 6 13 detector
2 40 age 17 19 detector
3 Mexican race_ethnicity 29 36 detector
4 veterinarian occupation 37 49 detector
5 Denver city 60 66 detector
6 Colorado state 68 76 detector
7 Jefferson High education_level 180 194 detector
8 DVM degree 210 213 detector
9 University of Colorado Boulder university 221 251 detector
10 wildlife health field_of_study 297 312 detector
11 English language 324 331 detector
12 Bobby first_name 333 338 detector
13 Bobby first_name 556 561 detector
14 VCA Animal Hospital organization_name 576 595 detector
15 Colorado Veterinary Clinic organization_name 613 639 detector
16 Christian Democrat political_view 707 725 detector
17 Maya first_name 798 802 detector
18 teenage age_group 818 825 augmenter
19 Aria and Leo first_name 836 848 detector
20 Bobby first_name 870 875 detector
21 Rockies place_name 894 901 detector
🏷️ Labels¶
- Entity label distribution across all records -- which types are most common.
In [8]:
Copied!
label_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
label_counts[entity["label"]] += 1
for label, count in label_counts.most_common():
print(f" {label}: {count}")
label_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
label_counts[entity["label"]] += 1
for label, count in label_counts.most_common():
print(f" {label}: {count}")
first_name: 22 organization_name: 6 age: 5 occupation: 5 city: 4 state: 4 degree: 4 university: 4 field_of_study: 4 last_name: 3 race_ethnicity: 3 political_view: 3 language: 2 religious_belief: 2 street_address: 2 education_level: 1 age_group: 1 place_name: 1 date_of_birth: 1 employment_status: 1 company_name: 1
📡 Sources¶
- Where each entity came from in the pipeline:
detector-- GLiNER NERaugmenter-- LLM-added (missed by GLiNER)validator-- LLM decision step over detector-seed entities (keep/reclass/drop); does not emit a separate source valuename_split-- derived from splitting full namespropagation-- expanded from validated entities to all text occurrences
In [9]:
Copied!
source_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
source_counts[entity.get("source", "unknown")] += 1
for source, count in source_counts.most_common():
print(f" {source}: {count}")
source_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
source_counts[entity.get("source", "unknown")] += 1
for source, count in source_counts.most_common():
print(f" {source}: {count}")
detector: 78 augmenter: 1
📊 By value¶
- Entities grouped by unique value -- this is what drives consistent replacement downstream (same name always maps to the same substitute).
In [10]:
Copied!
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")
for entry in by_value:
print(f" {entry['value']!r} -> labels: {entry['labels']}")
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")
for entry in by_value:
print(f" {entry['value']!r} -> labels: {entry['labels']}")
Record 0: 19 unique entity values '40' -> labels: ['age'] 'Aria and Leo' -> labels: ['first_name'] 'Bobby' -> labels: ['first_name'] 'Christian Democrat' -> labels: ['political_view'] 'Colorado' -> labels: ['state'] 'Colorado Veterinary Clinic' -> labels: ['organization_name'] 'DVM' -> labels: ['degree'] 'Denver' -> labels: ['city'] 'English' -> labels: ['language'] 'Jefferson High' -> labels: ['education_level'] 'Maya' -> labels: ['first_name'] 'Mexican' -> labels: ['race_ethnicity'] 'Rockies' -> labels: ['place_name'] 'University of Colorado Boulder' -> labels: ['university'] 'VCA Animal Hospital' -> labels: ['organization_name'] 'Watford' -> labels: ['last_name'] 'teenage' -> labels: ['age_group'] 'veterinarian' -> labels: ['occupation'] 'wildlife health' -> labels: ['field_of_study']
❌ Failures¶
- Records dropped during detection (LLM timeout, parse error, etc.).
- Check this to understand data loss in your pipeline.
In [11]:
Copied!
if result.failed_records:
for fr in result.failed_records:
print(f" record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
print("No failed records.")
if result.failed_records:
for fr in result.failed_records:
print(f" record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
print("No failed records.")
No failed records.
⏭️ Next steps¶
- 🕵️ Your First Anonymization -- the simplest end-to-end replace workflow if you haven't run it yet.
- 🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
- ✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.