detect
detect
¶
Classes:
| Name | Description |
|---|---|
DefaultLLMConfig |
Default settings for the LLM used in column classification. |
ColumnClassifier |
Abstract column-type classifier; implementations may use LLM, VertexAI, or other backends. |
ColumnClassifierNoop |
No-op classifier that assigns |
IAPIClassifierConfig |
Configuration for an inference-API-based column classifier. |
ColumnClassifierLLM |
Classify column types using an LLM (OpenAI-compatible inference API). |
ClassifyConfig |
Configuration for column classification and NER (entities, thresholds, GLiNER, regex). |
EntityExtractor |
Abstract extractor of entity/value pairs from free text. |
EntityExtractorNoop |
No-op extractor that returns no entities. |
EntityReport |
Per-entity stats for one column: count of detections and set of unique values. |
EntityExtractorRegexp |
Extract entities using regex-based NER pipeline. |
EntityExtractorGliner |
Extract entities from text using a GLiNER model with chunking and optional batch caching. |
EntityExtractorMulti |
Composite extractor that runs multiple extractors and concatenates their results. |
Functions:
| Name | Description |
|---|---|
classify_columns |
Classify DataFrame columns to entity types via LLM and return column-to-entity map. |
sample_columns |
Sample up to |
redact_from_entities |
Replace each detected span in |
traverse_redact |
Yield iterables of text segments and redacted spans for assembly via |
find_best |
Return the prediction with the largest span (used when merging overlapping spans). |
merge_subsume |
Merge overlapping NER spans into a single prediction per span using |
Attributes:
| Name | Type | Description |
|---|---|---|
NerReport |
Per-column NER report: column name → entity name → |
NerReport = dict[str, dict[str, EntityReport]]
module-attribute
¶
Per-column NER report: column name → entity name → EntityReport (counts and values).
DefaultLLMConfig
¶
Default settings for the LLM used in column classification.
All attributes are class-level. Used by classify_columns when calling the
inference API for column-type classification.
Attributes:
| Name | Type | Description |
|---|---|---|
CONFIG_ID |
Model identifier for the LLM. From env |
|
SYSTEM_PROMPT |
System message describing the column-type annotation task sent to the LLM. |
|
MAX_OUTPUT_TOKENS |
Maximum number of tokens allowed in the LLM response (default 2048). |
|
TEMPERATURE |
Sampling temperature for LLM generation (default 0.2). Lower values give more deterministic output. |
ColumnClassifier
¶
Bases: ABC
Abstract column-type classifier; implementations may use LLM, VertexAI, or other backends.
Methods:
| Name | Description |
|---|---|
detect_types |
Classify each column into one of the given entity types. |
detect_types(df, entities)
abstractmethod
¶
Classify each column into one of the given entity types.
Implementations may sample column values and use an LLM, lookup table, or
other backend to assign exactly one entity type per column. Columns that
cannot be classified or are not in entities should be mapped to
UNKNOWN_ENTITY.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame whose columns are to be classified. |
required |
entities
|
Optional[set[str]]
|
Set of valid entity type names to assign; may be |
required |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
ColumnClassifierNoop
¶
IAPIClassifierConfig(endpoint, model_key, job_id, num_samples)
dataclass
¶
Configuration for an inference-API-based column classifier.
Attributes:
| Name | Type | Description |
|---|---|---|
endpoint |
str
|
Inference endpoint URL. |
model_key |
str
|
Model identifier. |
job_id |
str
|
Job identifier. |
num_samples |
int
|
Number of value samples per column for classification. |
ColumnClassifierLLM()
¶
Bases: ColumnClassifier
Classify column types using an LLM (OpenAI-compatible inference API).
Construct via factory; set _llm and _num_samples before calling
detect_types. Not initialized with config in __init__.
Methods:
| Name | Description |
|---|---|
detect_types |
Sample column data and call the inference API to classify columns into entity types. |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
detect_types(df, entities)
¶
Sample column data and call the inference API to classify columns into entity types.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
ClassifyConfig(valid_entities, ner_threshold, ner_regexps_enabled, ner_entities, gliner_enabled, gliner_batch_mode_enabled, gliner_batch_mode_chunk_length, gliner_batch_mode_batch_size, gliner_model)
dataclass
¶
Configuration for column classification and NER (entities, thresholds, GLiNER, regex).
Attributes:
| Name | Type | Description |
|---|---|---|
valid_entities |
set[str]
|
Set of valid entity type names for classification. |
ner_threshold |
float
|
Score threshold for NER predictions. |
ner_regexps_enabled |
bool
|
Whether regex-based NER is enabled. |
ner_entities |
set[str] | None
|
Entity types for NER (or |
gliner_enabled |
bool
|
Whether GLiNER model is used. |
gliner_batch_mode_enabled |
bool
|
Whether GLiNER batch mode is enabled. |
gliner_batch_mode_chunk_length |
int
|
Chunk length for GLiNER. |
gliner_batch_mode_batch_size |
int
|
Batch size for GLiNER. |
gliner_model |
str
|
GLiNER model name or path. |
valid_entities
instance-attribute
¶
Set of valid entity type names for classification.
ner_threshold
instance-attribute
¶
Score threshold for NER predictions.
ner_regexps_enabled
instance-attribute
¶
Whether regex-based NER is enabled.
ner_entities
instance-attribute
¶
Entity types for NER (or None to use default).
gliner_enabled
instance-attribute
¶
Whether GLiNER model is used.
gliner_batch_mode_enabled
instance-attribute
¶
Whether GLiNER batch mode is enabled.
gliner_batch_mode_chunk_length
instance-attribute
¶
Chunk length for GLiNER.
gliner_batch_mode_batch_size
instance-attribute
¶
Batch size for GLiNER.
gliner_model
instance-attribute
¶
GLiNER model name or path.
EntityExtractor()
¶
Bases: ABC
Abstract extractor of entity/value pairs from free text.
Attributes:
| Name | Type | Description |
|---|---|---|
column_report |
NerReport
|
Per-column NER report (entity counts and values). |
current_column |
str
|
Name of the column currently being processed. |
Methods:
| Name | Description |
|---|---|
extract_entity_values |
Return a list of dicts with |
extract_ner_predictions |
Return NER predictions with spans and labels for dedup/merge across extractors. |
extract_and_replace_entities |
Run NER, merge/dedupe predictions, update |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
extract_entity_values(text, entities)
abstractmethod
¶
Return a list of dicts with entity and value keys for each detection.
extract_ner_predictions(text, entities)
abstractmethod
¶
Return NER predictions with spans and labels for dedup/merge across extractors.
extract_and_replace_entities(redact_fn, text, entities=None)
¶
Run NER, merge/dedupe predictions, update column_report, and replace spans with redact_fn.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
EntityExtractorNoop()
¶
Bases: EntityExtractor
No-op extractor that returns no entities.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
EntityReport(count, values)
dataclass
¶
EntityExtractorRegexp()
¶
Bases: EntityExtractor
Extract entities using regex-based NER pipeline.
Methods:
| Name | Description |
|---|---|
pipeline_from_entities |
Build a pipeline factory for the given entity set (or |
get_entity_extractor |
Return a regex extractor with entity types from |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
pipeline_from_entities(entities)
¶
Build a pipeline factory for the given entity set (or _entity_types if empty).
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
get_entity_extractor(clsfy_cfg)
classmethod
¶
Return a regex extractor with entity types from clsfy_cfg (or DEFAULT_ENTITIES).
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
EntityExtractorGliner()
¶
Bases: EntityExtractor
Extract entities from text using a GLiNER model with chunking and optional batch caching.
Use get_entity_extractor to construct; config comes from ClassifyConfig.
Methods:
| Name | Description |
|---|---|
get_entity_extractor |
Load GLiNER model and return extractor configured from |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
get_entity_extractor(clsfy_cfg)
classmethod
¶
Load GLiNER model and return extractor configured from clsfy_cfg.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
EntityExtractorMulti()
¶
Bases: EntityExtractor
Composite extractor that runs multiple extractors and concatenates their results.
Methods:
| Name | Description |
|---|---|
extract_entity_values |
Return combined entity/value dicts from all sub-extractors. |
extract_ner_predictions |
Return merged NER predictions from all sub-extractors. |
get_entity_extractor |
Return an empty composite; add extractors with |
add_entity_extractor |
Append an extractor to the composite. |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
extract_entity_values(text, entities=None)
¶
Return combined entity/value dicts from all sub-extractors.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
extract_ner_predictions(text, entities=None)
¶
Return merged NER predictions from all sub-extractors.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
get_entity_extractor(clsfy_cfg)
classmethod
¶
Return an empty composite; add extractors with add_entity_extractor.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
add_entity_extractor(extractor)
¶
classify_columns(df, entities, num_samples, client, on_validation_error, logger)
¶
Classify DataFrame columns to entity types via LLM and return column-to-entity map.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to classify. |
required |
entities
|
set[str]
|
Set of valid entity type names. |
required |
num_samples
|
Optional[int]
|
Number of value samples per column for the prompt. |
required |
client
|
Optional[OpenAI]
|
OpenAI client for chat completions. |
required |
on_validation_error
|
Callable[[], None]
|
Callback invoked when LLM output is invalid JSON. |
required |
logger
|
Logger
|
Logger for timing and context. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Optional[str]]
|
Map of column name to entity type (or |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
sample_columns(df, num_samples, random_state=None)
¶
Sample up to num_samples unique values per non-empty column for classification prompts.
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
redact_from_entities(text, detected, redact_fn)
¶
Replace each detected span in text with the result of redact_fn(prediction).
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
traverse_redact(text, entities, redact_fn)
¶
Yield iterables of text segments and redacted spans for assembly via chain().
Entities must be sorted by span; yields alternating slices of text and
redact_fn(entity) so that chain(*traverse_redact(...)) gives the full string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Source text. |
required |
entities
|
list[NERPrediction]
|
NER predictions with |
required |
redact_fn
|
RedactFn
|
Function mapping each prediction to its replacement string. |
required |
Yields:
| Type | Description |
|---|---|
Iterable[str]
|
Iterables of strings (text slices and redaction results). |
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
find_best(entities)
¶
Return the prediction with the largest span (used when merging overlapping spans).
Source code in src/nemo_safe_synthesizer/pii_replacer/data_editor/detect.py
merge_subsume(entities)
¶
Merge overlapping NER spans into a single prediction per span using find_best.