Skip to content

pii_replay

pii_replay

Classes:

Name Description
PIIReplayData

Per-column PII data listed in the PII Replay section of the SQS report.

PIIReplay

PII Replay metric -- counts PII values from the reference data appearing in the output.

PIIReplayData pydantic-model

Bases: BaseModel

Per-column PII data listed in the PII Replay section of the SQS report.

Fields:

column_name pydantic-field

The name of the column with PII data.

column_assigned_type pydantic-field

The assigned type for the column (text, unique identifier, date, email, etc.).

pii_type = UNKNOWN_ENTITY pydantic-field

Type of the PII data in the column. For non-text fields, same as column_assigned_type. For text fields, the PII entities detected within the text (race, SSN, address, etc.).

total_ref_data = 0 pydantic-field

Total rows in the reference data that contain PII values.

unique_ref_data = 0 pydantic-field

Count of distinct PII values for this entity in the reference column.

total_synth_data = 0 pydantic-field

Number of output rows whose column value matches a reference PII value.

unique_synth_data = 0 pydantic-field

Count of distinct reference PII values that appear in the output column.

unique_synth_data_percentage = 0 pydantic-field

Percentage of distinct reference PII values replayed in the output (unique_synth_data / unique_ref_data * 100).

PIIReplay pydantic-model

Bases: Component

PII Replay metric -- counts PII values from the reference data appearing in the output.

For each classified PII entity, reports total and unique replay counts. This component does not produce a numeric score; it surfaces PII leakage details for the HTML report.

Fields:

reference_total_records = 0 pydantic-field

Total rows in the reference data.

output_total_records = 0 pydantic-field

Total rows in the output data.

pii_replay_data = list() pydantic-field

Per-column / per-entity replay statistics.

jinja_context cached property

Template context with PII replay statistics and entity type list.

from_evaluation_dataset(evaluation_dataset, config=None) staticmethod

Compute PII replay counts from classified entity metadata.

Source code in src/nemo_safe_synthesizer/evaluation/components/pii_replay.py
@staticmethod
def from_evaluation_dataset(evaluation_dataset, config: SafeSynthesizerParameters | None = None) -> PIIReplay:
    """Compute PII replay counts from classified entity metadata."""
    if evaluation_dataset.column_statistics is None or len(evaluation_dataset.column_statistics) == 0:
        logger.warning("No classified entities, skipping PII Replay.")
        return PIIReplay(score=EvaluationScore())

    pii_replay_entities = config.get("pii_replay_entities") if config else None
    pii_replay_columns = config.get("pii_replay_columns") if config else None

    # Build up a list of keys of tuple(col_name, entity_type)
    classified_entities = []
    for col, column_statistics in evaluation_dataset.column_statistics.items():
        entity_names = column_statistics.detected_entity_counts.keys()
        entity_assigned_type = column_statistics.assigned_type
        # Scope down to user supplied set of entities if there is one
        if pii_replay_entities:
            entity_names = set(entity_names).intersection(set(pii_replay_entities))
        classified_entities += [(col, entity_name, entity_assigned_type) for entity_name in entity_names]
        # But add user specified set of columns as needed if there is one
        if pii_replay_columns:
            for user_specified_col in set(pii_replay_columns).difference(
                set(evaluation_dataset.column_statistics.keys())
            ):
                classified_entities.append((user_specified_col, UNKNOWN_ENTITY))

    pii_replay_data = []
    for col, entity_name, entity_assigned_type in classified_entities:
        # UNKNOWN_ENTITY case, use the entire column
        ref_entity_count = len(evaluation_dataset.reference[col])
        ref_entity_unique_values = evaluation_dataset.reference[col].unique()

        if entity_name != UNKNOWN_ENTITY:
            # Ideal case, use the count of detected entities in ref for that (col, entity_type).
            # Also get the set of unique values tagged with that entity type.
            ref_entity_count = evaluation_dataset.column_statistics[col].detected_entity_counts[entity_name]
            ref_entity_unique_values = evaluation_dataset.column_statistics[col].detected_entity_values[entity_name]

        # These are the same in both cases. We want the size of that set of ref[col] unique values.
        ref_entity_unique_count = len(ref_entity_unique_values)
        # We want the total number of rows in output[column] that contain some value from the ref[col] unique values.
        output_entity_values = (
            evaluation_dataset.output[col].to_frame().query(f"`{col}` in @ref_entity_unique_values")[col]
        )
        output_entity_count = len(output_entity_values)
        # With those query results, we also want to get the count of unique items in those filtered output results.
        output_entity_unique_count = len(output_entity_values.unique())

        pii_replay_data.append(
            PIIReplayData(
                column_name=col,
                column_assigned_type=entity_assigned_type,
                pii_type=entity_name,
                total_ref_data=ref_entity_count,
                unique_ref_data=ref_entity_unique_count,
                total_synth_data=output_entity_count,
                unique_synth_data=output_entity_unique_count,
                unique_synth_data_percentage=math.ceil(output_entity_unique_count / ref_entity_unique_count * 100),
            )
        )

    return PIIReplay(
        score=EvaluationScore(),
        reference_total_records=evaluation_dataset.reference.shape[0],
        output_total_records=evaluation_dataset.output.shape[0],
        pii_replay_data=pii_replay_data,
    )