privacy_metric_utils

`privacy_metric_utils` ¶

Functions:

Name	Description
`find_text_fields`	Identify columns in `df` whose content is free-form text.
`divide_tabular_text`	Split `df` into a tabular-only and a text-only DataFrame.
`embed_text`	Embed every text column in `df` and return a single averaged embedding per row.

`find_text_fields(df)` ¶

Identify columns in df whose content is free-form text.

Each column is passed through describe_field; those classified as "text" are returned.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame whose columns are inspected.	required

Returns:

Type	Description
`list[str]`	Column names classified as free-form text.

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py

def find_text_fields(df: pd.DataFrame) -> list[str]:
    """Identify columns in ``df`` whose content is free-form text.

    Each column is passed through ``describe_field``; those classified
    as ``"text"`` are returned.

    Args:
        df: DataFrame whose columns are inspected.

    Returns:
        Column names classified as free-form text.
    """
    text_fields: list[str] = []
    for col in df.columns:
        field_info = describe_field(col, df[col])
        if field_info.type.value == "text":
            text_fields.append(col)
    return text_fields

`divide_tabular_text(df, text_fields)` ¶

Split df into a tabular-only and a text-only DataFrame.

Columns present in text_fields go into the text DataFrame; the remaining columns go into the tabular DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Source DataFrame to split.	required
`text_fields`	`list[str]`	Column names to treat as text.	required

Returns:

Type	Description
`DataFrame`	A `(tabular_df, text_df)` tuple where `tabular_df` contains only
`DataFrame`	the non-text columns and `text_df` contains only the text columns.

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py

def divide_tabular_text(df: pd.DataFrame, text_fields: list[str]) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Split ``df`` into a tabular-only and a text-only DataFrame.

    Columns present in ``text_fields`` go into the text DataFrame; the
    remaining columns go into the tabular DataFrame.

    Args:
        df: Source DataFrame to split.
        text_fields: Column names to treat as text.

    Returns:
        A ``(tabular_df, text_df)`` tuple where ``tabular_df`` contains only
        the non-text columns and ``text_df`` contains only the text columns.
    """
    tabular_fields = [col for col in df.columns if col not in text_fields]
    return df.filter(tabular_fields), df.filter(text_fields)

`embed_text(df, embedder)` ¶

Embed every text column in df and return a single averaged embedding per row.

For each column the embedder produces a (n_rows, embed_dim) matrix. The per-column matrices are stacked and averaged across columns so that every column contributes equally to the final embedding.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame whose columns are all text to be embedded.	required
`embedder`	`SentenceTransformer`	Sentence-transformer model used to produce embeddings.	required

Returns:

Type	Description
`DataFrame`	Single-column DataFrame with column `"embedding"` whose values are
`DataFrame`	1-D tensors of shape `(embed_dim,)`.

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py

def embed_text(df: pd.DataFrame, embedder: SentenceTransformer) -> pd.DataFrame:
    """Embed every text column in ``df`` and return a single averaged embedding per row.

    For each column the ``embedder`` produces a ``(n_rows, embed_dim)`` matrix.
    The per-column matrices are stacked and averaged across columns so that
    every column contributes equally to the final embedding.

    Args:
        df: DataFrame whose columns are all text to be embedded.
        embedder: Sentence-transformer model used to produce embeddings.

    Returns:
        Single-column DataFrame with column ``"embedding"`` whose values are
        1-D tensors of shape ``(embed_dim,)``.
    """
    embeddings = {}
    for col in df.columns:
        data = [str(r) for r in df[col].to_list()]
        embeddings[col] = torch.as_tensor(embedder.encode(data, show_progress_bar=False, convert_to_tensor=True))

    stacked = torch.stack([embeddings[col] for col in df.columns], dim=0)  # shape: (n_cols, n_rows, embed_dim)
    avg_embeddings = torch.mean(stacked, dim=0)  # shape: (n_rows, embed_dim)

    return pd.DataFrame({"embedding": list(avg_embeddings)})

privacy_metric_utils