Detect¶
Entity detection is the first stage of every Anonymizer pipeline. Both replace and rewrite modes depend on it.
How it works¶
Detection combines a lightweight NER model (GLiNER-PII) with LLM-based refinement. GLiNER PII produces an initial set of entity spans, then an LLM augments it with entities the NER missed and validates each detection -- keeping, reclassifying, or dropping entities based on context.
When rewrite is configured, an additional step identifies latent entities -- sensitive information inferable from context but not explicitly stated in the text.
Example: standard vs. latent entities¶
Consider this short passage:
Sarah described her appointment. She's looking forward to ringing the bell soon and said the care team has been wonderful.
| Type | Value | Description |
|---|---|---|
| Standard entity | Sarah | A directly stated first name. |
| Latent entity | cancer treatment | Inferred from context. The passage never explicitly says "cancer," but "ringing the bell" can imply nearing the end of cancer treatment. |
Configuration¶
Detection is configured via the Detect object on AnonymizerConfig:
from anonymizer import AnonymizerConfig, Detect, Redact
config = AnonymizerConfig(
detect=Detect(),
replace=Redact(),
)
Detect fields¶
| Field | Default | Description |
|---|---|---|
entity_labels |
None (all defaults) |
List of labels to detect. Leave unset (or pass None) to use the full default set. |
gliner_threshold |
0.3 |
GLiNER confidence threshold (0.0--1.0). Lower values detect more entities but may increase false positives. |
validation_max_entities_per_call |
100 |
Maximum candidate entities per validator LLM call. Rows with more candidates are split into chunks. See Chunked validation. |
validation_excerpt_window_chars |
500 |
Characters of context included before and after a chunk's entity spans in the validator prompt. Bounds per-chunk prompt size; not the model's context-window limit. |
Chunked validation¶
When a row yields many entity candidates, validating them in a single LLM call can often exceed the model's context window or the provider's rate limits (tokens-per-minute or requests-per-minute quotas that many hosted models enforce). Anonymizer automatically splits validation for such rows: candidates are grouped in position order into chunks of at most validation_max_entities_per_call, and each chunk is validated independently with its own bounded text excerpt (validation_excerpt_window_chars before and after the chunk's span). Decisions are merged back into a single per-row set.
The chunked path is always on; if a row has fewer candidates than the limit, it runs as a single call and is exactly equivalent to the unchunked behavior. Tuning guidance:
- Raise
validation_max_entities_per_callif your validator has a large context window and you want fewer, larger calls. - Lower it if you hit provider rate limits or want more uniform per-call latency.
- Raise
validation_excerpt_window_charswhen short windows hide the context needed to disambiguate entities (e.g.,"John"as first name vs. last name depends on surrounding text). - Lower it to reduce per-chunk prompt tokens, at the risk of lower validation quality on context-sensitive labels.
Validator pools¶
entity_validator can be a single alias (the default) or a list of aliases — a pool. When multiple aliases are configured, each chunk in a row is dispatched to the next alias in round-robin order, which lets you work around per-alias rate limits by spreading requests across equivalent endpoints.
Pools also act as failover. If a chunk's assigned alias can't complete the call (an unrecoverable rate limit, a 5xx that didn't clear on retry, a malformed response), the same chunk is automatically retried against the other aliases in your pool before the row is given up on. A chunk only fails once every alias in the pool has failed for it. This is a cheap way to harden validation against any one endpoint having a bad day, on top of the load-spreading role.
What happens when a row can't be validated¶
If validation can't get a complete answer for a row — every alias in the pool has failed on at least one of that row's chunks — the row is dropped from the output rather than passed through with some entities unvalidated. This is deliberate: the alternative would be writing the original text back out with those entities still un-scrubbed, which is an undesired outcome.
Dropped rows show up on result.failed_records with step="detection", so you can tell which inputs didn't make it through by comparing input IDs against output IDs and reprocess those on a follow-up pass.
See Validator pools for the YAML syntax and caveats.
Entity labels¶
Anonymizer ships with a comprehensive default label set covering:
- Direct identifiers (e.g.
first_name,last_name,email,ssn,date_of_birth,street_address) - Quasi-identifiers (e.g.
age,city,state,country,occupation,company_name,date) - Technical data (e.g.
api_key,password,url,ipv4,ipv6,device_identifier) - Demographics (e.g.
gender,race_ethnicity,religious_belief,political_view,language) - Financial (e.g.
credit_debit_card,account_number,bank_routing_number,tax_id)
To inspect the full list:
from anonymizer import DEFAULT_ENTITY_LABELS
print(DEFAULT_ENTITY_LABELS)
Custom labels¶
When you pass entity_labels explicitly, the augmenter operates in strict mode -- it only outputs entities matching your list. When entity_labels=None, the augmenter can create additional labels beyond the defaults (e.g., clinic_name, server_name).
# Strict: only detect these 3 labels
Detect(entity_labels=["first_name", "last_name", "email"])
# Permissive: detect all defaults + LLM can infer new label types
Detect() # entity_labels=None
Tuning the threshold¶
For gliner_threshold, start with the default 0.3. If you're seeing too many false positives, raise it to 0.5. If entities are being missed, try lowering to 0.2. The LLM validation step catches many false positives, so erring on the side of lower thresholds is usually safe.
Model roles¶
The detection pipeline uses three model roles, each mapped to a model alias in the default config:
| Role | Default alias | Purpose |
|---|---|---|
entity_detector |
gliner-pii-detector |
GLiNER-PII NER model. |
entity_validator |
gpt-oss-120b |
Validates and reclassifies detected entities. |
entity_augmenter |
gpt-oss-120b |
Finds entities the NER model missed. |
latent_detector |
nemotron-30b-thinking |
Identifies inferable entities (rewrite only). |
See Models for how to override these.