privacy_metric_utils
privacy_metric_utils
¶
Functions:
| Name | Description |
|---|---|
find_text_fields |
Identify columns in |
divide_tabular_text |
Split |
embed_text |
Embed every text column in |
find_text_fields(df)
¶
Identify columns in df whose content is free-form text.
Each column is passed through describe_field; those classified
as "text" are returned.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame whose columns are inspected. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Column names classified as free-form text. |
Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py
divide_tabular_text(df, text_fields)
¶
Split df into a tabular-only and a text-only DataFrame.
Columns present in text_fields go into the text DataFrame; the
remaining columns go into the tabular DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Source DataFrame to split. |
required |
text_fields
|
list[str]
|
Column names to treat as text. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A |
DataFrame
|
the non-text columns and |
Source code in src/nemo_safe_synthesizer/evaluation/components/privacy_metric_utils.py
embed_text(df, embedder)
¶
Embed every text column in df and return a single averaged embedding per row.
For each column the embedder produces a (n_rows, embed_dim) matrix.
The per-column matrices are stacked and averaged across columns so that
every column contributes equally to the final embedding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame whose columns are all text to be embedded. |
required |
embedder
|
SentenceTransformer
|
Sentence-transformer model used to produce embeddings. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Single-column DataFrame with column |
DataFrame
|
1-D tensors of shape |