Skip to content

Detect

Entity detection is the first stage of every Anonymizer pipeline. Both replace and rewrite modes depend on it.


How it works

Detection combines a lightweight NER model (GLiNER-PII) with LLM-based refinement. GLiNER PII produces an initial set of entity spans, then an LLM augments it with entities the NER missed and validates each detection -- keeping, reclassifying, or dropping entities based on context.

When rewrite is configured, an additional step identifies latent entities -- sensitive information inferable from context but not explicitly stated in the text.

Example: standard vs. latent entities

Consider this short passage:

Sarah described her appointment. She's looking forward to ringing the bell soon and said the care team has been wonderful.

Type Value Description
Standard entity Sarah A directly stated first name.
Latent entity cancer treatment Inferred from context. The passage never explicitly says "cancer," but "ringing the bell" can imply nearing the end of cancer treatment.

Configuration

Detection is configured via the Detect object on AnonymizerConfig:

from anonymizer import AnonymizerConfig, Detect, Redact

config = AnonymizerConfig(
    detect=Detect(),
    replace=Redact(),
)

Detect fields

Field Default Description
entity_labels None (all defaults) List of labels to detect. Leave unset (or pass None) to use the full default set.
gliner_threshold 0.3 GLiNER confidence threshold (0.0--1.0). Lower values detect more entities but may increase false positives.

Entity labels

Anonymizer ships with a comprehensive default label set covering:

  • Direct identifiers (e.g. first_name, last_name, email, ssn, date_of_birth, street_address)
  • Quasi-identifiers (e.g. age, city, state, country, occupation, company_name, date)
  • Technical data (e.g. api_key, password, url, ipv4, ipv6, device_identifier)
  • Demographics (e.g. gender, race_ethnicity, religious_belief, political_view, language)
  • Financial (e.g. credit_debit_card, account_number, bank_routing_number, tax_id)

To inspect the full list:

from anonymizer import DEFAULT_ENTITY_LABELS
print(DEFAULT_ENTITY_LABELS)

Custom labels

When you pass entity_labels explicitly, the augmenter operates in strict mode -- it only outputs entities matching your list. When entity_labels=None, the augmenter can create additional labels beyond the defaults (e.g., clinic_name, server_name).

# Strict: only detect these 3 labels
Detect(entity_labels=["first_name", "last_name", "email"])

# Permissive: detect all defaults + LLM can infer new label types
Detect()  # entity_labels=None

Tuning the threshold

For gliner_threshold, start with the default 0.3. If you're seeing too many false positives, raise it to 0.5. If entities are being missed, try lowering to 0.2. The LLM validation step catches many false positives, so erring on the side of lower thresholds is usually safe.


Model roles

The detection pipeline uses three model roles, each mapped to a model alias in the default config:

Role Default alias Purpose
entity_detector gliner-pii-detector GLiNER-PII NER model.
entity_validator gpt-oss-120b Validates and reclassifies detected entities.
entity_augmenter gpt-oss-120b Finds entities the NER model missed.
latent_detector nemotron-30b-thinking Identifies inferable entities (rewrite only).

See Models for how to override these.