anonymizer_config
anonymizer_config
¶
Classes:
| Name | Description |
|---|---|
AnonymizerInput |
Input source definition for the anonymizer pipeline. |
Detect |
Configuration for the entity detection stage. |
Rewrite |
Configuration for rewrite-mode execution. |
AnonymizerConfig |
Primary user-facing config for anonymization behavior. |
Functions:
| Name | Description |
|---|---|
is_remote_input_source |
Return True when the input source is an HTTP(S) URL. |
has_unsupported_url_scheme |
Return True when the input looks like a URL but uses an unsupported scheme. |
infer_input_source_suffix |
Infer the lowercase file suffix from a local path or remote URL path. |
AnonymizerInput
pydantic-model
¶
Bases: BaseModel
Input source definition for the anonymizer pipeline.
Format is inferred from the file extension of a local path or HTTP(S) URL.
Fields:
-
source(str) -
text_column(str) -
id_column(str | None) -
data_summary(str | None)
Validators:
-
validate_source_path→source
source
pydantic-field
¶
Local path or HTTP(S) URL for a .csv or .parquet input file.
text_column = 'text'
pydantic-field
¶
Column containing the text to anonymize.
id_column = None
pydantic-field
¶
Optional column to use as record identifier.
data_summary = None
pydantic-field
¶
Short description of the data. Improves LLM detection accuracy.
Detect
pydantic-model
¶
Bases: BaseModel
Configuration for the entity detection stage.
Fields:
-
entity_labels(list[str] | None) -
gliner_threshold(float) -
validation_max_entities_per_call(int) -
validation_excerpt_window_chars(int)
Validators:
-
validate_entity_labels→entity_labels
entity_labels = None
pydantic-field
¶
Labels to detect. None uses the built-in default detection label set. To inspect the default set, use from anonymizer import DEFAULT_ENTITY_LABELS.
gliner_threshold = 0.3
pydantic-field
¶
GLiNER detection confidence threshold (0.0-1.0).
validation_max_entities_per_call = 100
pydantic-field
¶
Maximum number of candidate entities included in a single validator LLM call. When a row has more candidates than this, validation is split into chunks that are dispatched (round-robin) across the validator pool.
validation_excerpt_window_chars = 500
pydantic-field
¶
Number of characters to include before and after a chunk's entity span when building the text excerpt sent to the validator. Bounds the prompt context the validator sees per chunk; it is NOT the LLM's context window limit.
Rewrite
pydantic-model
¶
Bases: BaseModel
Configuration for rewrite-mode execution.
Fields:
-
privacy_goal(PrivacyGoal | None) -
instructions(str | None) -
risk_tolerance(RiskTolerance) -
max_repair_iterations(int) -
strict_entity_protection(bool)
Validators:
-
populate_default_privacy_goal
privacy_goal = None
pydantic-field
¶
Structured privacy goal. Auto-populated with defaults if not provided.
instructions = None
pydantic-field
¶
Additional instructions for the rewrite LLM.
risk_tolerance = RiskTolerance.low
pydantic-field
¶
Preset controlling repair thresholds and review flagging.
max_repair_iterations = 3
pydantic-field
¶
Maximum repair rounds. Set to 0 to disable repair.
strict_entity_protection = False
pydantic-field
¶
If True, requires every entity to receive a protective disposition during sensitivity analysis.
evaluation
property
¶
Construct EvaluationCriteria from this Rewrite config for the engine.
Rewrite and EvaluationCriteria both carry max_repair_iterations.
This property keeps them in sync: it passes through self.risk_tolerance
and self.max_repair_iterations. Leakage thresholds and repair
parameters are derived from risk_tolerance via _RiskToleranceBundle
(see rewrite.py).
Production code that starts from a user-facing Rewrite should pass
rewrite.evaluation into the engine — never duplicate the mapping
manually. Tests and engine-internal callers may construct
EvaluationCriteria directly when they aren't routing through a
user-facing Rewrite.
AnonymizerConfig
pydantic-model
¶
Bases: BaseModel
Primary user-facing config for anonymization behavior.
Fields:
Validators:
-
validate_exactly_one_mode
detect
pydantic-field
¶
Entity detection configuration.
replace = None
pydantic-field
¶
Replacement method (Substitute(), Redact(), Annotate(), or Hash()).
rewrite = None
pydantic-field
¶
Optional rewrite-mode parameters.
emit_telemetry = True
pydantic-field
¶
Whether to emit anonymous Anonymizer telemetry events. See the Telemetry section in the README for what is collected and how to opt out at the environment or CLI level.
is_remote_input_source(value)
¶
Return True when the input source is an HTTP(S) URL.
Source code in src/anonymizer/config/anonymizer_config.py
def is_remote_input_source(value: str) -> bool:
"""Return True when the input source is an HTTP(S) URL."""
parsed = urlparse(value)
return parsed.scheme in {"http", "https"}
has_unsupported_url_scheme(value)
¶
Return True when the input looks like a URL but uses an unsupported scheme.
Source code in src/anonymizer/config/anonymizer_config.py
def has_unsupported_url_scheme(value: str) -> bool:
"""Return True when the input looks like a URL but uses an unsupported scheme."""
parsed = urlparse(value)
return "://" in value and bool(parsed.scheme) and parsed.scheme not in {"http", "https"}
infer_input_source_suffix(value)
¶
Infer the lowercase file suffix from a local path or remote URL path.
Source code in src/anonymizer/config/anonymizer_config.py
def infer_input_source_suffix(value: str) -> str:
"""Infer the lowercase file suffix from a local path or remote URL path."""
if is_remote_input_source(value):
return Path(urlparse(value).path).suffix.lower()
return Path(value).suffix.lower()