Detect¶
Entity detection is the first stage of every Anonymizer pipeline. Both replace and rewrite modes depend on it.
How it works¶
Detection combines a lightweight NER model (GLiNER-PII) with LLM-based refinement. GLiNER PII produces an initial set of entity spans, then an LLM augments it with entities the NER missed and validates each detection -- keeping, reclassifying, or dropping entities based on context.
When rewrite is configured, an additional step identifies latent entities -- sensitive information inferable from context but not explicitly stated in the text.
Example: standard vs. latent entities¶
Consider this short passage:
Sarah described her appointment. She's looking forward to ringing the bell soon and said the care team has been wonderful.
| Type | Value | Description |
|---|---|---|
| Standard entity | Sarah | first_name |
| Latent entity | ringing the bell | Reveals something sensitive - nearing the end of cancer treatment. It is only inferable from surrounding context (e.g. the passage never says "cancer"). |
Configuration¶
Detection is configured via the Detect object on AnonymizerConfig:
from anonymizer import AnonymizerConfig, Detect, Redact
config = AnonymizerConfig(
detect=Detect(),
replace=Redact(),
)
Detect fields¶
| Field | Default | Description |
|---|---|---|
entity_labels |
None (all defaults) |
List of labels to detect. Leave unset (or pass None) to use the full default set. |
gliner_threshold |
0.3 |
GLiNER confidence threshold (0.0--1.0). Lower values detect more entities but may increase false positives. |
Entity labels¶
Anonymizer ships with a comprehensive default label set covering:
- Direct identifiers (e.g.
first_name,last_name,email,ssn,date_of_birth,street_address) - Quasi-identifiers (e.g.
age,city,state,country,occupation,company_name,date) - Technical data (e.g.
api_key,password,url,ipv4,ipv6,device_identifier) - Demographics (e.g.
gender,race_ethnicity,religious_belief,political_view,language) - Financial (e.g.
credit_debit_card,account_number,bank_routing_number,tax_id)
To inspect the full list:
from anonymizer import DEFAULT_ENTITY_LABELS
print(DEFAULT_ENTITY_LABELS)
Custom labels¶
When you pass entity_labels explicitly, the augmenter operates in strict mode -- it only outputs entities matching your list. When entity_labels=None, the augmenter can create additional labels beyond the defaults (e.g., clinic_name, server_name).
# Strict: only detect these 3 labels
Detect(entity_labels=["first_name", "last_name", "email"])
# Permissive: detect all defaults + LLM can infer new label types
Detect() # entity_labels=None
Tuning the threshold¶
For gliner_threshold, start with the default 0.3. If you're seeing too many false positives, raise it to 0.5. If entities are being missed, try lowering to 0.2. The LLM validation step catches many false positives, so erring on the side of lower thresholds is usually safe.
Model roles¶
The detection pipeline uses three model roles, each mapped to a model alias in the default config:
| Role | Default alias | Purpose |
|---|---|---|
entity_detector |
gliner-pii-detector |
GLiNER-PII NER model. |
entity_validator |
gpt-oss-120b |
Validates and reclassifies detected entities. |
entity_augmenter |
gpt-oss-120b |
Finds entities the NER model missed. |
latent_detector |
nemotron-30b-thinking |
Identifies inferable entities (rewrite only). |
See Models for how to override these.