🕵️ Inspecting Detected Entities¶
Dig into the entity detection pipeline output -- what was detected, what the LLM validator kept or dropped, and where entities appear in the text.
This notebook is for users who need to debug detection quality, tune labels and/or thresholds, or investigate downstream replacement or rewriting results.
We use Annotate mode because it preserves the original text while tagging each entity with its label, making it ideal for reviewing detection quality.
📚 What you'll learn¶
- Run the detection pipeline and inspect its output using Annotate mode
- View tagged text with entities marked inline
- Break down detected entities by label, source, and unique value
- Identify and triage failed records
Tip: First time running notebooks? Start with setup instructions.
⚙️ Setup¶
- Check if your
NVIDIA_API_KEYfrom build.nvidia.com is registered for model access.- The default
build.nvidia.com(NVIDIA Build) setup is a convenient way to try Anonymizer and iterate on previews. Use of NVIDIA Build is subject to NVIDIA Build's own terms of service and privacy practices, which are separate from and independent of the NeMo Framework library. NVIDIA Build is intended for evaluation and testing purposes only and may not be used in production environments. Do not upload any confidential information or personal data when using NVIDIA Build. Your use of NVIDIA Build is logged for security purposes and to improve NVIDIA products and services. - Request and token rate limits on
build.nvidia.comvary by account and model access, and lower-volume development access can be slow for full-dataset runs. Start withpreview()on a small sample, then move to your own endpoint for production data and usage.
- The default
- Import the core classes -- this notebook uses
Annotateto keep original values visible. Anonymizer()initializes with the default model provider -- no extra config needed.configure_logging(LoggingConfig.default())keeps logs at INFO. Switch toLoggingConfig.debug()when troubleshooting.
In [2]:
Copied!
import getpass
import os
from collections import Counter
import pandas as pd
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
import getpass
import os
from collections import Counter
import pandas as pd
if not os.getenv("NVIDIA_API_KEY"):
key = getpass.getpass("Enter NVIDIA_API_KEY from build.nvidia.com: ").strip()
if not key:
raise RuntimeError("NVIDIA_API_KEY is required to run these notebooks.")
os.environ["NVIDIA_API_KEY"] = key
In [ ]:
Copied!
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, configure_logging
configure_logging(LoggingConfig.default())
from anonymizer import Annotate, Anonymizer, AnonymizerConfig, AnonymizerInput, LoggingConfig, configure_logging
configure_logging(LoggingConfig.default())
In [4]:
Copied!
anonymizer = Anonymizer()
anonymizer = Anonymizer()
[13:16:40] [INFO] 🔧 Anonymizer initialized with 3 model configs
[13:16:40] [INFO] |-- 🔎 detector: gliner-pii-detector
[13:16:40] [INFO] |-- ✅ validator: gpt-oss-120b
[13:16:40] [INFO] |-- 🧩 augmenter: gpt-oss-120b
👁️ Preview¶
- Detection runs as part of any strategy.
Annotatekeeps original text visible alongside entity labels -- ideal for debugging. trace_dataframeexposes every internal pipeline column; that's what we explore below.
In [5]:
Copied!
config = AnonymizerConfig(replace=Annotate())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
result = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
config = AnonymizerConfig(replace=Annotate())
input_data = AnonymizerInput(
source="https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv",
text_column="biography",
data_summary="Biographical profiles",
)
result = anonymizer.preview(
config=config,
data=input_data,
num_records=3,
)
[13:16:40] [INFO] 👀 Preview mode: 📂 Loaded 3 records from https://raw.githubusercontent.com/NVIDIA-NeMo/Anonymizer/refs/heads/main/docs/data/NVIDIA_synthetic_biographies.csv (column: 'biography')
[13:16:40] [INFO] 🔍 Running entity detection on 3 records
[13:16:40] [INFO] detection labels in scope: (default: 65 labels; see anonymizer.DEFAULT_ENTITY_LABELS for list)
[13:17:12] [INFO] |-- 📋 Detection complete — 77 entities found across 3 records (0 failed) [31.9s]
[13:17:12] [INFO] |-- labels: first_name=23, organization_name=7, state=6, age=5, occupation=5, city=5, last_name=3, race_ethnicity=3, language=3, company_name=3, political_view=3, education_level=3, street_address=2, degree=1, university=1, date_of_birth=1, field_of_study=1, employment_status=1, religious_belief=1
[13:17:12] [INFO] 🔄 Running Annotate replacement
[13:17:12] [INFO] |-- 📋 Replacement complete (0 failed) [0.0s]
[13:17:12] [INFO] 🎉 Pipeline complete — 3 records processed, 0 total failures
🔍 Inspect¶
display_record()renders an interactive view with entity highlights.
In [6]:
Copied!
result.display_record(0)
result.display_record(0)
📋 Columns¶
trace_dataframecontains all internal columns from the pipeline (detection, validation, replacement, etc.).
In [7]:
Copied!
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
df = result.trace_dataframe
print(f"Records: {len(df)}")
print(f"Columns: {list(df.columns)}")
Records: 3 Columns: ['biography', '_anonymizer_record_id', '_raw_detected_entities', '_seed_entities', '_tag_notation', '_seed_validation_candidates', '_seed_tagged_text', '_validated_entities', '_seed_entities_json', '_initial_tagged_text', '_validated_seed_entities', '_augmented_entities', '_merged_entities', '_merged_tagged_text', '_validation_candidates', '_detected_entities', 'biography_with_spans', 'final_entities', '_entities_by_value', '_replacement_map', 'biography_replaced']
🎯 Detected entities¶
- Final entity list after validation. Each entity has
value,label, positions,score, andsource(detector / augmenter / name_split / propagation).
In [8]:
Copied!
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")
entity_df = pd.DataFrame(entities)
if not entity_df.empty:
cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
print(entity_df[cols].to_string())
row_idx = 0
raw = df.loc[row_idx, "_detected_entities"]
entities = raw["entities"] if isinstance(raw, dict) else raw
print(f"Record {row_idx}: {len(entities)} entities detected\n")
entity_df = pd.DataFrame(entities)
if not entity_df.empty:
cols = [c for c in ["value", "label", "start_position", "end_position", "source"] if c in entity_df.columns]
print(entity_df[cols].to_string())
Record 0: 20 entities detected
value label start_position end_position source
0 Bobby first_name 0 5 detector
1 Watford last_name 6 13 detector
2 40 age 17 19 detector
3 Mexican race_ethnicity 29 36 detector
4 veterinarian occupation 37 49 detector
5 Denver city 60 66 detector
6 Colorado state 68 76 detector
7 Jefferson High organization_name 180 194 augmenter
8 DVM degree 210 213 detector
9 University of Colorado Boulder university 221 251 augmenter
10 English language 324 331 detector
11 Bobby first_name 333 338 detector
12 Bobby first_name 556 561 detector
13 VCA Animal Hospital company_name 576 595 detector
14 Colorado Veterinary Clinic organization_name 613 639 augmenter
15 Christian Democrat political_view 707 725 detector
16 Maya first_name 798 802 detector
17 Aria first_name 836 840 augmenter
18 Leo first_name 845 848 augmenter
19 Bobby first_name 870 875 detector
🏷️ Labels¶
- Entity label distribution across all records -- which types are most common.
In [9]:
Copied!
label_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
label_counts[entity["label"]] += 1
for label, count in label_counts.most_common():
print(f" {label}: {count}")
label_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
label_counts[entity["label"]] += 1
for label, count in label_counts.most_common():
print(f" {label}: {count}")
first_name: 23 organization_name: 7 state: 6 age: 5 occupation: 5 city: 5 last_name: 3 race_ethnicity: 3 language: 3 company_name: 3 political_view: 3 education_level: 3 street_address: 2 degree: 1 university: 1 date_of_birth: 1 field_of_study: 1 employment_status: 1 religious_belief: 1
📡 Sources¶
- Where each entity came from in the pipeline:
detector-- GLiNER NERaugmenter-- LLM-added (missed by GLiNER)validator-- LLM decision step over detector-seed entities (keep/reclass/drop); does not emit a separate source valuename_split-- derived from splitting full namespropagation-- expanded from validated entities to all text occurrences
In [10]:
Copied!
source_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
source_counts[entity.get("source", "unknown")] += 1
for source, count in source_counts.most_common():
print(f" {source}: {count}")
source_counts = Counter()
for raw in df["_detected_entities"]:
entity_list = raw["entities"] if isinstance(raw, dict) else raw
for entity in entity_list:
source_counts[entity.get("source", "unknown")] += 1
for source, count in source_counts.most_common():
print(f" {source}: {count}")
detector: 68 augmenter: 9
📊 By value¶
- Entities grouped by unique value -- this is what drives consistent replacement downstream (same name always maps to the same substitute).
In [11]:
Copied!
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")
for entry in by_value:
print(f" {entry['value']!r} -> labels: {entry['labels']}")
row_idx = 0
raw_bv = df.loc[row_idx, "_entities_by_value"]
by_value = raw_bv["entities_by_value"] if isinstance(raw_bv, dict) else raw_bv
print(f"Record {row_idx}: {len(by_value)} unique entity values\n")
for entry in by_value:
print(f" {entry['value']!r} -> labels: {entry['labels']}")
Record 0: 17 unique entity values '40' -> labels: ['age'] 'Aria' -> labels: ['first_name'] 'Bobby' -> labels: ['first_name'] 'Christian Democrat' -> labels: ['political_view'] 'Colorado' -> labels: ['state'] 'Colorado Veterinary Clinic' -> labels: ['organization_name'] 'DVM' -> labels: ['degree'] 'Denver' -> labels: ['city'] 'English' -> labels: ['language'] 'Jefferson High' -> labels: ['organization_name'] 'Leo' -> labels: ['first_name'] 'Maya' -> labels: ['first_name'] 'Mexican' -> labels: ['race_ethnicity'] 'University of Colorado Boulder' -> labels: ['university'] 'VCA Animal Hospital' -> labels: ['company_name'] 'Watford' -> labels: ['last_name'] 'veterinarian' -> labels: ['occupation']
❌ Failures¶
- Records dropped during detection (LLM timeout, parse error, etc.).
- Check this to understand data loss in your pipeline.
In [12]:
Copied!
if result.failed_records:
for fr in result.failed_records:
print(f" record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
print("No failed records.")
if result.failed_records:
for fr in result.failed_records:
print(f" record_id={fr.record_id}, step={fr.step}, reason={fr.reason}")
else:
print("No failed records.")
No failed records.
⏭️ Next steps¶
- 🕵️ Your First Anonymization -- the simplest end-to-end replace workflow if you haven't run it yet.
- 🎯 Choosing a Replacement Strategy -- compare Redact, Annotate, Hash, and Substitute side-by-side.
- ✏️ Rewriting Biographies -- generate privacy-safe paraphrases instead of token-level replacements.