Skip to content

nemo_pii

nemo_pii

Classes:

Name Description
ColumnClassification

Classification and detected-entity info for a column prior to transform.

NemoPII

PII replacement over DataFrames via classification, NER, and configurable transforms.

Functions:

Name Description
classify_config_from_params

Build classification and NER config from PII replacer config.

build_entity_extractor

Build a composite entity extractor from classification config.

get_column_classifier

Return a column classifier backed by the NSS inference endpoint (NSS_INFERENCE_ENDPOINT, NSS_INFERENCE_KEY).

Attributes:

Name Type Description
ACCOUNTING_FUNCTIONS

Transform function names tracked for report accounting (which functions were used per column).

ACCOUNTING_FUNCTIONS = ['re', 'fake', 'random', 'hash', 'normalize', 'partial_mask', 'tld', 'date_shift', 'date_time_shift', 'date_format', 'date_time_format', 'detect_entities', 'redact_entities', 'label_entities', 'hash_entities', 'fake_entities', 'drop'] module-attribute

Transform function names tracked for report accounting (which functions were used per column).

ColumnClassification pydantic-model

Bases: BaseModel

Classification and detected-entity info for a column prior to transform.

When entity is None (e.g. unclassified), entity_count is None and entity_values is an empty list.

Fields:

field_name pydantic-field

Name of the field/column.

column_type pydantic-field

Detected column type (e.g. text, numeric).

entity pydantic-field

Detected entity type (e.g. email, phone), or None if none.

entity_count = None pydantic-field

Number of non-empty values in this field. None if no entity detected.

entity_values pydantic-field

Unique values for this field. Empty if no entity detected.

NemoPII(config=None)

Bases: object

PII replacement over DataFrames via classification, NER, and configurable transforms.

Call classify_df to get column classifications, then transform_df to replace PII. The result and per-column statistics are on result after transform_df.

Parameters:

Name Type Description Default
config PiiReplacerConfig | None

PII replacer config. If None, default config is used.

None

Attributes:

Name Type Description
result TransformResult

Result of the last transform_df (TransformResult with transformed_df and column_statistics).

Example

nemo_pii = NemoPII() nemo_pii.transform_df(df) result = nemo_pii.result print(result.transformed_df) print(result.column_statistics)

Methods:

Name Description
classify_df

Classify each column (type and entity) using config and optional LLM classifier.

transform_df

Replace PII in the DataFrame and set self.result.

Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def __init__(self, config: PiiReplacerConfig | None = None):
    if config:
        self.pii_replacer_config = config
    else:
        self.pii_replacer_config = PiiReplacerConfig.get_default_config()

    self.classify_config = classify_config_from_params(self.pii_replacer_config)

    # TODO: clean up to use pydantic model directly or something typed
    # internally, for now just convert to dict to match existing code.
    self.data_editor_config = self.pii_replacer_config.model_dump()

    self.entity_extractor = build_entity_extractor(self.classify_config)
    self.editor = Editor(self.data_editor_config, self.entity_extractor)
    self.elapsed_time = 0.0

classify_df(df)

Classify each column (type and entity) using config and optional LLM classifier.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to classify.

required

Returns:

Type Description
list[ColumnClassification]

List of ColumnClassification, one per column, with field name, column type,

list[ColumnClassification]

entity, entity count, and unique entity values.

Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def classify_df(self, df: pd.DataFrame) -> list[ColumnClassification]:
    """Classify each column (type and entity) using config and optional LLM classifier.

    Args:
        df: DataFrame to classify.

    Returns:
        List of ``ColumnClassification``, one per column, with field name, column type,
        entity, entity count, and unique entity values.
    """
    # Pre-initialize with defaults
    entities = {}
    columns = {item: None for item in df.columns}

    try:
        # Only attempt classification if enabled
        if self.pii_replacer_config.globals.classify.enable_classify is not False:
            column_classifier = None

            # Try to initialize the column classifier
            try:
                column_classifier = get_column_classifier()
            except Exception as exc:
                logging.error(
                    "Could not initialize column classifier, PII replacement will run in degraded mode. NER Falling back to default entities. No replacement done except for text columns. %s",
                    _column_classify_failure_remediation(exc),
                    exc_info=_inference_key_configured(),
                )

            # Try to perform classification if we successfully got a classifier
            if column_classifier is not None:
                try:
                    columns = column_classifier.detect_types(df, self.classify_config.valid_entities)

                    entities = {
                        name: (
                            entity
                            if entity != UNKNOWN_ENTITY and entity in self.classify_config.valid_entities
                            else None
                        )
                        for (name, entity) in columns.items()
                    }
                except Exception as exc:
                    logging.error(
                        "Could not initialize column classifier, PII replacement will run in degraded mode. NER Falling back to default entities. No replacement done except for text columns. %s",
                        _column_classify_failure_remediation(exc),
                        exc_info=_inference_key_configured(),
                    )
        else:
            logging.info("Column classification is disabled (enable_classify=False), skipping classify call.")
    finally:
        # Use field type detection to identify text columns if not already
        # assigned an entity. These text columns are where NER is used if
        # enabled during transform_df.
        field_results = []
        fields = [describe_field(field_name, df[field_name]) for field_name in df.columns]
        for field in fields:
            entity_count = field.count
            entity_values = field.unique_values_list
            entity = entities.get(field.name, None)
            existing_type = columns.get(field.name, None)
            # Determine column type
            is_text_without_type = (
                existing_type is None or existing_type.lower() == "none"
            ) and field.type == FieldType.TEXT
            column_type = "text" if is_text_without_type else existing_type

            # Missing entities, are not expected to have entity_count and entity_values.
            if entity is None:
                entity_count = None
                entity_values = []
            field_results.append(
                ColumnClassification(
                    field_name=field.name,
                    column_type=column_type,
                    entity=entity,  # currently this is None for text fields.
                    entity_count=entity_count,
                    entity_values=entity_values,
                )
            )
    return field_results

transform_df(df, classifications=None)

Replace PII in the DataFrame and set self.result.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to transform.

required
classifications list[ColumnClassification] | None

Optional precomputed classifications. If None, classify_df is run first.

None
Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def transform_df(self, df: pd.DataFrame, classifications: list[ColumnClassification] | None = None) -> None:
    """Replace PII in the DataFrame and set ``self.result``.

    Args:
        df: DataFrame to transform.
        classifications: Optional precomputed classifications. If ``None``,
            ``classify_df`` is run first.
    """
    pii_replacer_start = time.monotonic()
    try:
        if not classifications:
            classifications = self.classify_df(df)

        # Convert classification result to entities and column types dicts for editor
        column_types_dict = {field.field_name: field.column_type for field in classifications}
        entities_dict = {field.field_name: field.entity for field in classifications}
        transform_fn_accounting = TransformFnAccounting(ACCOUNTING_FUNCTIONS)

        transformed_df = self.editor.process_df(
            df, entities_dict, column_types_dict, fnreport=transform_fn_accounting
        )
        self.result = TransformResult(
            transformed_df=transformed_df,
            column_statistics=_build_column_statistics(
                classifications, transform_fn_accounting, self.entity_extractor.column_report
            ),
        )
    except Exception as e:
        logging.exception("Error transforming dataframe")
        raise e

    finally:
        self.elapsed_time = time.monotonic() - pii_replacer_start

classify_config_from_params(config)

Build classification and NER config from PII replacer config.

Parameters:

Name Type Description Default
config PiiReplacerConfig

PII replacer config containing globals for classify and NER.

required

Returns:

Type Description
ClassifyConfig

ClassifyConfig with valid entities, NER settings, and GLiNER options.

Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def classify_config_from_params(
    config: PiiReplacerConfig,
) -> ClassifyConfig:
    """Build classification and NER config from PII replacer config.

    Args:
        config: PII replacer config containing globals for classify and NER.

    Returns:
        ``ClassifyConfig`` with valid entities, NER settings, and GLiNER options.
    """
    valid_entities = DEFAULT_ENTITIES

    if config.globals.classify.entities is not None:
        valid_entities = set(config.globals.classify.entities)

    ner_entities = valid_entities
    if config.globals.ner.ner_entities is not None:
        ner_entities = set(config.globals.ner.ner_entities)

    cc = ClassifyConfig(
        valid_entities=valid_entities,
        ner_threshold=config.globals.ner.ner_threshold,
        ner_regexps_enabled=config.globals.ner.enable_regexps,
        ner_entities=ner_entities,
        gliner_enabled=config.globals.ner.gliner.enable_gliner,
        gliner_batch_mode_enabled=config.globals.ner.gliner.enable_batch_mode,
        gliner_batch_mode_chunk_length=config.globals.ner.gliner.chunk_length,
        gliner_batch_mode_batch_size=config.globals.ner.gliner.batch_size,
        gliner_model=config.globals.ner.gliner.gliner_model,
    )

    return cc

build_entity_extractor(clsfy_cfg)

Build a composite entity extractor from classification config.

Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def build_entity_extractor(clsfy_cfg: ClassifyConfig) -> EntityExtractor:
    """Build a composite entity extractor from classification config."""
    entity_extractor = EntityExtractorMulti.get_entity_extractor(clsfy_cfg)
    if clsfy_cfg.gliner_enabled:
        entity_extractor.add_entity_extractor(EntityExtractorGliner.get_entity_extractor(clsfy_cfg))
    if clsfy_cfg.ner_regexps_enabled:
        entity_extractor.add_entity_extractor(EntityExtractorRegexp.get_entity_extractor(clsfy_cfg))
    return entity_extractor

get_column_classifier()

Return a column classifier backed by the NSS inference endpoint (NSS_INFERENCE_ENDPOINT, NSS_INFERENCE_KEY).

Source code in src/nemo_safe_synthesizer/pii_replacer/nemo_pii.py
def get_column_classifier() -> ColumnClassifierLLM:
    """Return a column classifier backed by the NSS inference endpoint (``NSS_INFERENCE_ENDPOINT``, ``NSS_INFERENCE_KEY``)."""
    classifier = ColumnClassifierLLM()
    classifier._num_samples = 5

    endpoint = _get_classify_endpoint_url()

    # When using Inference Gateway, no API key is needed (gateway handles auth).
    # For legacy direct endpoint, NSS_INFERENCE_KEY can be provided.
    api_key = os.environ.get("NSS_INFERENCE_KEY", "not-needed")

    classifier._llm = OpenAI(api_key=api_key, base_url=endpoint)
    return classifier