Skip to content

privacy_metric_utils

privacy_metric_utils

Functions:

Name Description
find_text_fields

Identify columns in df whose content is free-form text.

divide_tabular_text

Split df into a tabular-only and a text-only DataFrame.

embed_text

Embed every text column in df and return a single averaged embedding per row.

find_text_fields(df)

Identify columns in df whose content is free-form text.

Each column is passed through describe_field; those classified as "text" are returned.

Parameters:

Name Type Description Default
df DataFrame

DataFrame whose columns are inspected.

required

Returns:

Type Description
list[str]

Column names classified as free-form text.

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py
def find_text_fields(df: pd.DataFrame) -> list[str]:
    """Identify columns in ``df`` whose content is free-form text.

    Each column is passed through ``describe_field``; those classified
    as ``"text"`` are returned.

    Args:
        df: DataFrame whose columns are inspected.

    Returns:
        Column names classified as free-form text.
    """
    text_fields: list[str] = []
    for col in df.columns:
        field_info = describe_field(col, df[col])
        if field_info.type.value == "text":
            text_fields.append(col)
    return text_fields

divide_tabular_text(df, text_fields)

Split df into a tabular-only and a text-only DataFrame.

Columns present in text_fields go into the text DataFrame; the remaining columns go into the tabular DataFrame.

Parameters:

Name Type Description Default
df DataFrame

Source DataFrame to split.

required
text_fields list[str]

Column names to treat as text.

required

Returns:

Type Description
DataFrame

A (tabular_df, text_df) tuple where tabular_df contains only

DataFrame

the non-text columns and text_df contains only the text columns.

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py
def divide_tabular_text(df: pd.DataFrame, text_fields: list[str]) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Split ``df`` into a tabular-only and a text-only DataFrame.

    Columns present in ``text_fields`` go into the text DataFrame; the
    remaining columns go into the tabular DataFrame.

    Args:
        df: Source DataFrame to split.
        text_fields: Column names to treat as text.

    Returns:
        A ``(tabular_df, text_df)`` tuple where ``tabular_df`` contains only
        the non-text columns and ``text_df`` contains only the text columns.
    """
    tabular_fields = [col for col in df.columns if col not in text_fields]
    return df.filter(tabular_fields), df.filter(text_fields)

embed_text(df, embedder)

Embed every text column in df and return a single averaged embedding per row.

For each column the embedder produces a (n_rows, embed_dim) matrix. The per-column matrices are stacked and averaged across columns so that every column contributes equally to the final embedding.

Parameters:

Name Type Description Default
df DataFrame

DataFrame whose columns are all text to be embedded.

required
embedder SentenceTransformer

Sentence-transformer model used to produce embeddings.

required

Returns:

Type Description
DataFrame

Single-column DataFrame with column "embedding" whose values are

DataFrame

1-D tensors of shape (embed_dim,).

Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py
def embed_text(df: pd.DataFrame, embedder: SentenceTransformer) -> pd.DataFrame:
    """Embed every text column in ``df`` and return a single averaged embedding per row.

    For each column the ``embedder`` produces a ``(n_rows, embed_dim)`` matrix.
    The per-column matrices are stacked and averaged across columns so that
    every column contributes equally to the final embedding.

    Args:
        df: DataFrame whose columns are all text to be embedded.
        embedder: Sentence-transformer model used to produce embeddings.

    Returns:
        Single-column DataFrame with column ``"embedding"`` whose values are
        1-D tensors of shape ``(embed_dim,)``.
    """
    embeddings = {}
    for col in df.columns:
        data = [str(r) for r in df[col].to_list()]
        embeddings[col] = torch.as_tensor(embedder.encode(data, show_progress_bar=False, convert_to_tensor=True))

    stacked = torch.stack([embeddings[col] for col in df.columns], dim=0)  # shape: (n_cols, n_rows, embed_dim)
    avg_embeddings = torch.mean(stacked, dim=0)  # shape: (n_rows, embed_dim)

    return pd.DataFrame({"embedding": list(avg_embeddings)})