Skip to content

Detect

Entity detection is the first stage of every Anonymizer pipeline. Both replace and rewrite modes depend on it.


How it works

Detection combines a lightweight NER model (GLiNER-PII) with LLM-based refinement. GLiNER PII produces an initial set of entity spans, then an LLM augments it with entities the NER missed and validates each detection -- keeping, reclassifying, or dropping entities based on context.

When rewrite is configured, an additional step identifies latent entities -- sensitive information inferable from context but not explicitly stated in the text.

Example: standard vs. latent entities

Consider this short passage:

Sarah described her appointment. She's looking forward to ringing the bell soon and said the care team has been wonderful.

Type Value Description
Standard entity Sarah A directly stated first name.
Latent entity cancer treatment Inferred from context. The passage never explicitly says "cancer," but "ringing the bell" can imply nearing the end of cancer treatment.

Configuration

Detection is configured via the Detect object on AnonymizerConfig:

from anonymizer import AnonymizerConfig, Detect, Redact

config = AnonymizerConfig(
    detect=Detect(),
    replace=Redact(),
)

Detect fields

Field Default Description
entity_labels None (all defaults) List of labels to detect. Leave unset (or pass None) to use the full default set.
gliner_threshold 0.3 GLiNER confidence threshold (0.0--1.0). Lower values detect more entities but may increase false positives.
validation_max_entities_per_call 100 Maximum candidate entities per validator LLM call. Rows with more candidates are split into chunks. See Chunked validation.
validation_excerpt_window_chars 500 Characters of context included before and after a chunk's entity spans in the validator prompt. Bounds per-chunk prompt size; not the model's context-window limit.

Chunked validation

When a row yields many entity candidates, validating them in a single LLM call can often exceed the model's context window or the provider's rate limits (tokens-per-minute or requests-per-minute quotas that many hosted models enforce). Anonymizer automatically splits validation for such rows: candidates are grouped in position order into chunks of at most validation_max_entities_per_call, and each chunk is validated independently with its own bounded text excerpt (validation_excerpt_window_chars before and after the chunk's span). Decisions are merged back into a single per-row set.

The chunked path is always on; if a row has fewer candidates than the limit, it runs as a single call and is exactly equivalent to the unchunked behavior. Tuning guidance:

  • Raise validation_max_entities_per_call if your validator has a large context window and you want fewer, larger calls.
  • Lower it if you hit provider rate limits or want more uniform per-call latency.
  • Raise validation_excerpt_window_chars when short windows hide the context needed to disambiguate entities (e.g., "John" as first name vs. last name depends on surrounding text).
  • Lower it to reduce per-chunk prompt tokens, at the risk of lower validation quality on context-sensitive labels.

Validator pools

entity_validator can be a single alias (the default) or a list of aliases — a pool. When multiple aliases are configured, each chunk in a row is dispatched to the next alias in round-robin order, which lets you work around per-alias rate limits by spreading requests across equivalent endpoints.

Pools also act as failover. If a chunk's assigned alias can't complete the call (an unrecoverable rate limit, a 5xx that didn't clear on retry, a malformed response), the same chunk is automatically retried against the other aliases in your pool before the row is given up on. A chunk only fails once every alias in the pool has failed for it. This is a cheap way to harden validation against any one endpoint having a bad day, on top of the load-spreading role.

What happens when a row can't be validated

If validation can't get a complete answer for a row — every alias in the pool has failed on at least one of that row's chunks — the row is dropped from the output rather than passed through with some entities unvalidated. This is deliberate: the alternative would be writing the original text back out with those entities still un-scrubbed, which is an undesired outcome.

Dropped rows show up on result.failed_records with step="detection", so you can tell which inputs didn't make it through by comparing input IDs against output IDs and reprocess those on a follow-up pass.

See Validator pools for the YAML syntax and caveats.

Entity labels

Anonymizer ships with a comprehensive default label set covering:

  • Direct identifiers (e.g. first_name, last_name, email, ssn, date_of_birth, street_address)
  • Quasi-identifiers (e.g. age, city, state, country, occupation, company_name, date)
  • Technical data (e.g. api_key, password, url, ipv4, ipv6, device_identifier)
  • Demographics (e.g. gender, race_ethnicity, religious_belief, political_view, language)
  • Financial (e.g. credit_debit_card, account_number, bank_routing_number, tax_id)

To inspect the full list:

from anonymizer import DEFAULT_ENTITY_LABELS
print(DEFAULT_ENTITY_LABELS)

Custom labels

When you pass entity_labels explicitly, the augmenter operates in strict mode -- it only outputs entities matching your list. When entity_labels=None, the augmenter can create additional labels beyond the defaults (e.g., clinic_name, server_name).

# Strict: only detect these 3 labels
Detect(entity_labels=["first_name", "last_name", "email"])

# Permissive: detect all defaults + LLM can infer new label types
Detect()  # entity_labels=None

Tuning the threshold

For gliner_threshold, start with the default 0.3. If you're seeing too many false positives, raise it to 0.5. If entities are being missed, try lowering to 0.2. The LLM validation step catches many false positives, so erring on the side of lower thresholds is usually safe.


Model roles

The detection pipeline uses three model roles, each mapped to a model alias in the default config:

Role Default alias Purpose
entity_detector gliner-pii-detector GLiNER-PII NER model.
entity_validator gpt-oss-120b Validates and reclassifies detected entities.
entity_augmenter gpt-oss-120b Finds entities the NER model missed.
latent_detector nemotron-30b-thinking Identifies inferable entities (rewrite only).

See Models for how to override these.