stats

`stats` ¶

Statistical functions for synthetic data evaluation.

Provides distribution comparison (Jensen-Shannon), correlation matrix computation (Pearson, Theil's U, Correlation Ratio), PCA, and data-overlap detection used by the evaluation components.

Functions:

Name	Description
`count_memorized_lines`	Count exact row matches between training and synthetic data.
`get_categorical_field_distribution`	Compute the normalized value-count distribution of a categorical column.
`get_numeric_distribution_bins`	Compute shared histogram bin edges for two numeric series.
`get_numeric_field_distribution`	Compute the normalized distribution of a numeric column cut into bins.
`compute_distribution_distance`	Compute the Jensen-Shannon distance between two distributions.
`calculate_pearsons_r`	Compute the Pearson correlation coefficient for a column pair.
`calculate_correlation_ratio`	Compute the Correlation Ratio for a categorical-numeric column pair.
`calculate_theils_u`	Compute Theil's U (uncertainty coefficient) for two categorical columns.
`calculate_correlation`	Build a full correlation matrix using Pearson, Theil's U, and Correlation Ratio.
`normalize_dataset`	Normalize a dataframe for PCA: fill missing values, encode categoricals, and standardize.
`compute_pca`	Run PCA on a single dataframe after normalization.
`compute_joined_pcas`	Run joined PCA: fit on reference, transform both reference and output.
`count_missing`	Count total missing (NaN/null) values across all cells.
`percent_missing`	Compute the percentage of missing values in a dataframe.

`count_memorized_lines(df1, df2)` ¶

Count exact row matches between training and synthetic data.

Parameters:

Name	Type	Description	Default
`df1`	`DataFrame`	Training dataframe.	required
`df2`	`DataFrame`	Synthetic dataframe.	required

Returns:

Type	Description
`int`	Number of rows present in both dataframes after deduplication.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def count_memorized_lines(df1: pd.DataFrame, df2: pd.DataFrame) -> int:
    """Count exact row matches between training and synthetic data.

    Args:
        df1: Training dataframe.
        df2: Synthetic dataframe.

    Returns:
        Number of rows present in both dataframes after deduplication.
    """

    # Look for cases where col is numeric in one df, object in the other. Attempt to cast to float.
    def _uptype_object_to_float(
        l: pd.DataFrame,  # noqa: E741
        r: pd.DataFrame,  # noqa: E741
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        for col in set(l.columns).intersection(set(r.columns)):
            if is_numeric_dtype(l[col]) or is_numeric_dtype(r[col]):
                if not is_numeric_dtype(l[col]) or not is_numeric_dtype(r[col]):
                    try:
                        l = l.astype({col: "float"})  # noqa: E741
                        r = r.astype({col: "float"})  # noqa: E741
                    except Exception:
                        # In particular ValueErrors if the non-numeric is not convertible, but catch everything.
                        pass
        return l, r

    # Convert any numeric fields within a df to 'float' first.
    def _floatify(df: pd.DataFrame) -> pd.DataFrame:
        conversions = {}
        for col in df.columns:
            if is_numeric_dtype(df[col]):
                conversions[col] = "float"
        return df.astype(conversions)

    # If one col in a df is numeric and the corresponding one in the other df is NOT, cast to 'object'.
    def _objectify(
        l: pd.DataFrame,  # noqa: E741
        r: pd.DataFrame,  # noqa: E741
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        conversions = {}
        for col in l.columns:
            if col in r.columns:
                if not is_numeric_dtype(l[col]) or not is_numeric_dtype(r[col]):
                    conversions[col] = "object"
        return l.astype(conversions), r.astype(conversions)

    # Do the casts.
    l, r = _uptype_object_to_float(df1, df2)  # noqa: E741
    l, r = _objectify(_floatify(l), _floatify(r))  # noqa: E741

    # Do an inner join on the intersection of columns present in both dfs.
    inner_join = pd.merge(l.drop_duplicates(), r.drop_duplicates())

    return len(inner_join)

`get_categorical_field_distribution(field)` ¶

Compute the normalized value-count distribution of a categorical column.

Parameters:

Name	Type	Description	Default
`field`	`Series`	Column series to analyze.	required

Returns:

Type	Description
`dict`	Mapping of `{value_str: percentage}` where percentages are in `[0, 100]`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def get_categorical_field_distribution(field: pd.Series) -> dict:
    """Compute the normalized value-count distribution of a categorical column.

    Args:
        field: Column series to analyze.

    Returns:
        Mapping of ``{value_str: percentage}`` where percentages are in ``[0, 100]``.
    """
    distribution = {}
    if len(field) > 0:
        for v in field:
            distribution[str(v)] = distribution.get(str(v), 0) + 1
        series_len = float(len(field))
        for k in distribution.keys():
            distribution[k] = distribution[k] * 100 / series_len
    return distribution

`get_numeric_distribution_bins(training, synthetic)` ¶

Compute shared histogram bin edges for two numeric series.

Uses the "doane" strategy on the combined data, falling back to 500 fixed bins if the result is empty or too large.

Parameters:

Name	Type	Description	Default
`training`	`Series`	Numeric series from the training dataframe.	required
`synthetic`	`Series`	Numeric series from the synthetic dataframe.	required

Returns:

Type	Description
	Array of bin edges spanning both series.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def get_numeric_distribution_bins(training: pd.Series, synthetic: pd.Series):
    """Compute shared histogram bin edges for two numeric series.

    Uses the ``"doane"`` strategy on the combined data, falling back to
    500 fixed bins if the result is empty or too large.

    Args:
        training: Numeric series from the training dataframe.
        synthetic: Numeric series from the synthetic dataframe.

    Returns:
        Array of bin edges spanning both series.
    """
    training = training.replace([np.inf, -np.inf], np.nan).dropna().astype("float64")
    synthetic = synthetic.replace([np.inf, -np.inf], np.nan).dropna().astype("float64")
    # Numeric data. Want the same bins between both df's. We bin based on scrubbed data.
    if len(training) == 0:
        min_value = np.nanmin(synthetic)
        max_value = np.nanmax(synthetic)
    elif len(synthetic) == 0:
        min_value = np.nanmin(training)
        max_value = np.nanmax(training)
    else:
        min_value = min(np.nanmin(training), np.nanmin(synthetic))
        max_value = max(np.nanmax(training), np.nanmax(synthetic))
    bins = np.array([], dtype=np.float64)

    # Use 'doane' to find bins. 'fd' causes too many OOM issues.
    # We also bin across the training and synthetic Series combined since we are binning across the combined range, otherwise we can see OOM's or sigkill's.
    try:
        bins = np.histogram_bin_edges(pd.concat([training, synthetic]), bins="doane", range=(min_value, max_value))
    except Exception:
        pass
    # If 'doane' still doesn't do the trick just force 500 bins.
    if len(bins) == 0 or len(bins) > 500:
        try:
            bins = np.histogram_bin_edges(pd.concat([training, synthetic]), bins=500, range=(min_value, max_value))
        except Exception:
            pass
    return bins

`get_numeric_field_distribution(field, bins)` ¶

Compute the normalized distribution of a numeric column cut into bins.

Parameters:

Name	Type	Description	Default
`field`	`Series`	Numeric column series.	required
`bins`		Bin edges (typically from `get_numeric_distribution_bins`).	required

Returns:

Type	Description
`dict`	Mapping of `{bin_label: proportion}` where proportions are in `[0, 1]`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def get_numeric_field_distribution(field: pd.Series, bins) -> dict:
    """Compute the normalized distribution of a numeric column cut into bins.

    Args:
        field: Numeric column series.
        bins: Bin edges (typically from ``get_numeric_distribution_bins``).

    Returns:
        Mapping of ``{bin_label: proportion}`` where proportions are in ``[0, 1]``.
    """
    binned_data = pd.cut(field, bins, include_lowest=True)
    distribution = {}
    for d in binned_data:
        if str(d) != "nan":
            distribution[str(d)] = distribution.get(str(d), 0) + 1
    field_length = len(binned_data)
    for k in distribution.keys():
        distribution[k] = distribution[k] / field_length
    return distribution

`compute_distribution_distance(d1, d2)` ¶

Compute the Jensen-Shannon distance between two distributions.

Parameters:

Name	Type	Description	Default
`d1`	`dict`	First distribution dict (values are a probability vector).	required
`d2`	`dict`	Second distribution dict.	required

Returns:

Type	Description
`float`	JS distance in `[0, 1]`. Returns `0.5887` if either distribution
`float`	sums to zero.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def compute_distribution_distance(d1: dict, d2: dict) -> float:
    """Compute the Jensen-Shannon distance between two distributions.

    Args:
        d1: First distribution dict (values are a probability vector).
        d2: Second distribution dict.

    Returns:
        JS distance in ``[0, 1]``. Returns ``0.5887`` if either distribution
        sums to zero.
    """
    all_keys = set(d1.keys()).union(set(d2.keys()))
    if len(all_keys) == 0:
        return 0.0
    d1_values = []
    d2_values = []
    for k in all_keys:
        d1_values.append(d1.get(k, 0.0))
        d2_values.append(d2.get(k, 0.0))
    sd1 = sum(d1_values)
    sd2 = sum(d2_values)
    if sd1 == 0 or np.isnan(sd1) or sd2 == 0 or np.isnan(sd2):
        return 0.5887
    return float(jensenshannon(np.asarray(d1_values), np.asarray(d2_values), base=2))

`calculate_pearsons_r(x, y, opt)` ¶

Compute the Pearson correlation coefficient for a column pair.

Parameters:

Name	Type	Description	Default
`x`	`Series \| ndarray`	First input array.	required
`y`	`Series \| ndarray`	Second input array.	required
`opt`	`bool`	If `False`, drop rows where either value is NaN before computing. If `True`, assume NaNs have already been replaced.	required

Returns:

Type	Description
`tuple[float, float]`	Tuple of (Pearson r, two-tailed p-value).

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def calculate_pearsons_r(x: pd.Series | np.ndarray, y: pd.Series | np.ndarray, opt: bool) -> tuple[float, float]:
    """Compute the Pearson correlation coefficient for a column pair.

    Args:
        x: First input array.
        y: Second input array.
        opt: If ``False``, drop rows where either value is NaN before
            computing.  If ``True``, assume NaNs have already been replaced.

    Returns:
        Tuple of (Pearson r, two-tailed p-value).
    """
    if not opt:
        # drop missing values, when either the x or y value is null/nan
        arr = (
            pd.DataFrame(np.array([x, y]).transpose(), columns=["x", "y"])
            .replace([np.inf, -np.inf], np.nan)
            .dropna(axis="index", how="any")
        )
        conditions = [len(arr["x"]) <= 1, len(arr["y"]) <= 1, arr["x"].nunique() <= 1, arr["y"].nunique() <= 1]
        if any(conditions):
            return 0.0, 0.0
        return pearsonr(arr["x"], arr["y"])
    else:
        # else we've already replaced nan's with 0's for entire datafile
        return pearsonr(x, y)

`calculate_correlation_ratio(x, y, opt)` ¶

Compute the Correlation Ratio for a categorical-numeric column pair.

Parameters:

Name	Type	Description	Default
`x`	`Series`	Categorical input array.	required
`y`	`Series`	Numeric input array.	required
`opt`	`bool`	If `False`, drop rows where `y` is NaN before computing.	required

Returns:

Type	Description
`float`	Correlation ratio in `[0, 1]`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def calculate_correlation_ratio(x: pd.Series, y: pd.Series, opt: bool) -> float:
    """Compute the Correlation Ratio for a categorical-numeric column pair.

    Args:
        x: Categorical input array.
        y: Numeric input array.
        opt: If ``False``, drop rows where ``y`` is NaN before computing.

    Returns:
        Correlation ratio in ``[0, 1]``.
    """
    if not opt:
        # Drop missing values if y (the numeric column) is null/nan
        df = pd.DataFrame({"x": x, "y": y}).replace(to_replace=[np.inf, -np.inf], value=np.nan, inplace=False).dropna()
        x = df["x"]
        y = df["y"]
    if len(x) < 2 or len(y) < 2:
        return 0.0
    else:
        # Either way, we've dealt with missing values by now, so tell dython not to do anything
        return correlation_ratio(x, y, nan_strategy="none")

`calculate_theils_u(x, y)` ¶

Compute Theil's U (uncertainty coefficient) for two categorical columns.

Parameters:

Name	Type	Description	Default
`x`		First categorical array.	required
`y`		Second categorical array.	required

Returns:

Type	Description
	Theil's U in `[0, 1]`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def calculate_theils_u(x, y):
    """Compute Theil's U (uncertainty coefficient) for two categorical columns.

    Args:
        x: First categorical array.
        y: Second categorical array.

    Returns:
        Theil's U in ``[0, 1]``.
    """
    # Drop missing values if x or y is null/nan
    df = pd.DataFrame({"x": x, "y": y})
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df.dropna(inplace=True)
    x = df["x"]
    y = df["y"]
    if len(x) == 0 or len(y) == 0:
        return 0
    else:
        return theils_u(x, y, nan_strategy="none")

`calculate_correlation(df, nominal_columns=None, job_count=_DEFAULT_JOB_COUNT, opt=False)` ¶

Build a full correlation matrix using Pearson, Theil's U, and Correlation Ratio.

Numeric-numeric pairs use Pearson's r, categorical-categorical pairs use Theil's U, and categorical-numeric pairs use Correlation Ratio (or Theil's U for highly-unique categoricals).

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input dataframe.	required
`nominal_columns`	`list[str] \| None`	Columns to treat as categorical.	`None`
`job_count`	`int`	Number of parallel jobs for pairwise computations.	`_DEFAULT_JOB_COUNT`
`opt`	`bool`	If `True`, globally replace NaNs with `0` for speed (slightly less accurate).	`False`

Returns:

Type	Description
`DataFrame`	Square correlation dataframe indexed and columned by `df.columns`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def calculate_correlation(
    df: pd.DataFrame,
    nominal_columns: list[str] | None = None,
    job_count: int = _DEFAULT_JOB_COUNT,
    opt: bool = False,
) -> pd.DataFrame:
    """Build a full correlation matrix using Pearson, Theil's U, and Correlation Ratio.

    Numeric-numeric pairs use Pearson's r, categorical-categorical pairs use
    Theil's U, and categorical-numeric pairs use Correlation Ratio (or
    Theil's U for highly-unique categoricals).

    Args:
        df: Input dataframe.
        nominal_columns: Columns to treat as categorical.
        job_count: Number of parallel jobs for pairwise computations.
        opt: If ``True``, globally replace NaNs with ``0`` for speed
            (slightly less accurate).

    Returns:
        Square correlation dataframe indexed and columned by ``df.columns``.
    """
    # PLAT-1131 Ensure that all nominal columns are present in df.
    if nominal_columns is not None:
        _df_cols = df.columns
        nominal_columns = [col for col in nominal_columns if col in _df_cols]
    # If opt is True, then go the faster (just not quite as accurate) route of global replace missing with 0
    if opt:
        df.replace([np.inf, -np.inf], np.nan, inplace=True)
        df.fillna(_DEFAULT_REPLACE_VALUE, inplace=True)

    columns = df.columns
    if nominal_columns is None:
        nominal_columns = list()

    df_cp = df.copy()
    # Convert pd.Int64Dtype() and boolean columns to object dtype to handle NaN values with NaN proxies.
    pandas_dtypes = df_cp[nominal_columns].columns[
        df_cp[nominal_columns].dtypes.apply(lambda x: x.name in [pd.BooleanDtype(), pd.Int64Dtype()])
    ]

    df_cp[pandas_dtypes] = df_cp[pandas_dtypes].astype("object")
    # Replace NaNs with NaN proxy only for nominal columns. This helps with more consistency in FCS regardless of the generated row counts.
    df_cp[nominal_columns] = df_cp[nominal_columns].fillna(f"safe-synthesizer-{uuid.uuid4().hex}")

    corr = np.zeros((len(columns), len(columns)))
    single_value_columns = []
    numeric_columns = []

    column_to_index = {}

    # Set up all the column groupings needed for correlation
    for i, c in enumerate(columns):
        if df_cp[c].nunique() == 1:
            single_value_columns.append(c)
        elif c not in nominal_columns:
            if df_cp[c].dtype == "object":
                nominal_columns.append(c)
            else:
                numeric_columns.append(c)

        column_to_index[c] = i

    # Replace NaNs with NaN proxy one more time since the nominal column might have updated.
    df_cp[nominal_columns] = df_cp[nominal_columns].fillna(f"safe-synthesizer-{uuid.uuid4().hex}")
    nominal = [x for x in nominal_columns if x not in single_value_columns]
    df_rows = df_cp.shape[0]
    high_unique_nominal = []
    completely_unique_nominal = []
    not_high_unique_nominal = []
    uniqueness_ratios = df_cp.nunique() / df_rows
    for c in nominal:
        if uniqueness_ratios[c] == 1:
            completely_unique_nominal.append(c)
        elif uniqueness_ratios[c] > UNIQUENESS_THRESHOLD:
            high_unique_nominal.append(c)
        else:
            not_high_unique_nominal.append(c)

    notcompletely_unique_nominal = [x for x in nominal if x not in completely_unique_nominal]

    # Do Theil's U shortcut for 100% unique nominal (Amy invention that is 99.9% correct, and saves massive time)
    for x in completely_unique_nominal:
        x_index = column_to_index[x]

        corr[x_index, :] = 1.0
        for y in columns:
            y_index = column_to_index[y]
            if x == y:
                corr[y_index][x_index] = 1.0
            # Edge case, guard against ValueError in math.log when the other column is empty
            elif df_cp[y].nunique() == 0:
                corr[y_index][x_index] = 0.0
            else:
                corr[y_index][x_index] = math.log(df_cp[y].nunique()) / math.log(df_cp[x].nunique())

    for x in single_value_columns:
        x_index = column_to_index[x]
        corr[:, x_index] = 0.0
        corr[x_index, :] = 0.0
        corr[x_index, x_index] = 1.0

    # Do nominal-nominal excluding any that are 100% unique (Theil's U)
    scores = Parallel(n_jobs=job_count)(
        delayed(calculate_theils_u)(df_cp[field1], df_cp[field2])
        for field1 in notcompletely_unique_nominal
        for field2 in notcompletely_unique_nominal
    )
    i = 0
    for field1 in notcompletely_unique_nominal:
        field1_index = column_to_index[field1]
        for field2 in notcompletely_unique_nominal:
            field2_index = column_to_index[field2]
            if field1 == field2:
                corr[field1_index][field2_index] = 1.0
            else:
                # looks backward, but is correct
                corr[field2_index][field1_index] = scores[i]
            i += 1

    # Do "not_high_unique_nominal with numeric" (Correlation Ratio)
    scores = Parallel(n_jobs=job_count)(
        delayed(calculate_correlation_ratio)(df_cp[field1], df_cp[field2], opt)
        for field1 in not_high_unique_nominal
        for field2 in numeric_columns
    )
    i = 0
    for field1 in not_high_unique_nominal:
        field1_index = column_to_index[field1]
        for field2 in numeric_columns:
            field2_index = column_to_index[field2]
            corr[field1_index][field2_index] = scores[i]
            corr[field2_index][field1_index] = scores[i]
            i += 1

    # Do high_unique_nominal with numeric (Theil's U) (excluding 100% unique)
    # This fixes the problem of highly unique categorical causing mass instability when using
    # the normal approach of correlation ratio.  Because there are so many categorical buckets
    # many end up with just one number is them, which causes correlation ratio's approach of
    # comparing the mean within buckets to the mean overall to give unstable, over inflated
    # correlation values.  Using Theil's U instead gives a much more realistic score.
    # Because Theil's U is asymmetric, doing x-y and y-x correlation separately.
    # U(nominal|numeric)
    scores_xy = Parallel(n_jobs=job_count)(
        delayed(calculate_theils_u)(df_cp[field1], df_cp[field2])
        for field1 in high_unique_nominal
        for field2 in numeric_columns
    )
    # U(numeric|nominal)
    scores_yx = Parallel(n_jobs=job_count)(
        delayed(calculate_theils_u)(df_cp[field2], df_cp[field1])
        for field1 in high_unique_nominal
        for field2 in numeric_columns
    )
    i = 0
    for field1 in high_unique_nominal:
        field1_index = column_to_index[field1]
        for field2 in numeric_columns:
            field2_index = column_to_index[field2]
            corr[field2_index][field1_index] = scores_xy[i]
            corr[field1_index][field2_index] = scores_yx[i]
            i += 1

    # Do numeric numeric (Pearson's)
    num_len = len(numeric_columns)
    if num_len > 1:
        delayed_calls = []
        for i in range(num_len - 1):
            for j in range(i + 1, num_len):
                delayed_calls.append(
                    delayed(calculate_pearsons_r)(df_cp[numeric_columns[i]], df_cp[numeric_columns[j]], opt)
                )
        scores = Parallel(n_jobs=1)(delayed_calls)
        x = 0
        for i in range(num_len - 1):
            num_columns_index = column_to_index[numeric_columns[i]]
            for j in range(i + 1, num_len):
                num_columns_jindex = column_to_index[numeric_columns[j]]
                corr[num_columns_index][num_columns_jindex] = scores[x][0]
                corr[num_columns_jindex][num_columns_index] = scores[x][0]
                x += 1

    for x in numeric_columns:
        x_index = column_to_index[x]
        corr[x_index][x_index] = 1.0

    corr_final = pd.DataFrame(corr, index=columns, columns=columns)
    corr_final[corr_final == np.inf] = 0
    corr_final.fillna(value=np.nan, inplace=True)

    return corr_final

`normalize_dataset(df)` ¶

Normalize a dataframe for PCA: fill missing values, encode categoricals, and standardize.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Raw dataframe to prepare.	required

Returns:

Type	Description
`DataFrame`	Standardized dataframe (mean 0, std 1) with all columns numeric.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def normalize_dataset(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize a dataframe for PCA: fill missing values, encode categoricals, and standardize.

    Args:
        df: Raw dataframe to prepare.

    Returns:
        Standardized dataframe (mean 0, std 1) with all columns numeric.
    """
    df_cp = df.copy()
    # Divide the dataframe into numeric and categorical
    nominal_columns = list(df_cp.select_dtypes(include=["object", "category", "boolean"]).columns)
    int64_dtypes = df_cp.columns[df_cp.dtypes == pd.Int64Dtype()]
    # Convert pandas boolean dtype columns to object to replace possible NaNs with "Missing" values:
    boolean_dtypes = df_cp[nominal_columns].columns[df_cp[nominal_columns].dtypes == pd.BooleanDtype()]
    df_cp[boolean_dtypes] = df_cp[boolean_dtypes].astype("object")
    numeric_columns = []
    for c in df_cp.columns:
        if c not in nominal_columns:
            numeric_columns.append(c)
    df_cat = df_cp.reindex(columns=nominal_columns)
    df_num = df_cp.reindex(columns=numeric_columns)
    df_cat_labels = pd.DataFrame()
    # Fill missing values and encode categorical columns by the frequency of each value
    if len(numeric_columns) > 0:
        if len(int64_dtypes) > 0:
            # pd.Int64Dtype only accepts integers, hence convert any float median to an integer.
            df_num[int64_dtypes] = df_num[int64_dtypes].fillna(df_num[int64_dtypes].median().astype(int))

        df_num = df_num.fillna(df_num.median())

    if len(nominal_columns) > 0:
        df_cat = df_cat.fillna("Missing")
        encoder = CountEncoder()
        df_cat_labels = pd.DataFrame(encoder.fit_transform(df_cat))

    # Merge numeric and categorical back into one dataframe
    if len(nominal_columns) == 0:
        new_df = df_num
    elif len(numeric_columns) == 0:
        new_df = df_cat_labels
    else:
        new_df = pd.concat([df_num, df_cat_labels], axis=1, sort=False)

    # Finally, standardize all values
    all_columns = nominal_columns + numeric_columns
    new_df = pd.DataFrame(StandardScaler().fit_transform(new_df), columns=all_columns)

    return new_df

`compute_pca(df, n_components=2)` ¶

Run PCA on a single dataframe after normalization.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to decompose.	required
`n_components`	`int`	Number of principal components to keep.	`2`

Returns:

Type	Description
`DataFrame`	Dataframe with columns `pc1`, `pc2`, etc.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def compute_pca(df: pd.DataFrame, n_components: int = 2) -> pd.DataFrame:
    """Run PCA on a single dataframe after normalization.

    Args:
        df: Dataframe to decompose.
        n_components: Number of principal components to keep.

    Returns:
        Dataframe with columns ``pc1``, ``pc2``, etc.
    """
    seed = 444
    df_ = df.replace([np.inf, -np.inf], np.nan, inplace=False).dropna(axis="columns", how="all")

    df_norm = normalize_dataset(df_)
    pca = PCA(n_components=n_components, random_state=seed)
    projected = pca.fit_transform(df_norm)
    columns = [f"pc{i + 1}" for i in range(n_components)]
    return pd.DataFrame(data=projected, columns=columns)

`compute_joined_pcas(reference_df, output_df, n_components=2, include_variance=False)` ¶

Run joined PCA: fit on reference, transform both reference and output.

Parameters:

Name	Type	Description	Default
`reference_df`	`DataFrame`	Training dataframe (used to fit the scaler and PCA).	required
`output_df`	`DataFrame`	Synthetic dataframe (transformed only).	required
`n_components`	`int`	Number of principal components to keep.	`2`
`include_variance`	`bool`	If `True`, column names include explained variance ratios.	`False`

Returns:

Type	Description
`tuple[DataFrame, DataFrame]`	Tuple of (reference PCA dataframe, output PCA dataframe).

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def compute_joined_pcas(
    reference_df: pd.DataFrame,
    output_df: pd.DataFrame,
    n_components: int = 2,
    include_variance: bool = False,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Run joined PCA: fit on reference, transform both reference and output.

    Args:
        reference_df: Training dataframe (used to fit the scaler and PCA).
        output_df: Synthetic dataframe (transformed only).
        n_components: Number of principal components to keep.
        include_variance: If ``True``, column names include explained variance ratios.

    Returns:
        Tuple of (reference PCA dataframe, output PCA dataframe).
    """
    seed = 444

    # Normalize the train and synthetic dataframes to mean 0 and std 1
    sc = StandardScaler()
    reference_norm = sc.fit_transform(reference_df)
    output_norm = sc.transform(output_df)

    pca = PCA(n_components=n_components, random_state=seed)
    projected_reference = pca.fit_transform(reference_norm)
    projected_output = pca.transform(output_norm)

    if include_variance:
        eigenvalues = pca.explained_variance_ratio_
        columns = [f"pc{i + 1} - variance {eigenvalues[i]:.2f}" for i in range(n_components)]
    else:
        columns = [f"pc{i + 1}" for i in range(n_components)]

    reference_pca = pd.DataFrame(data=projected_reference, columns=columns)
    output_pca = pd.DataFrame(data=projected_output, columns=columns)

    return (reference_pca, output_pca)

`count_missing(df)` ¶

Count total missing (NaN/null) values across all cells.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to inspect.	required

Returns:

Type	Description
`int`	Total number of missing values.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def count_missing(df: pd.DataFrame) -> int:
    """Count total missing (NaN/null) values across all cells.

    Args:
        df: Dataframe to inspect.

    Returns:
        Total number of missing values.
    """
    return int(df.isnull().sum().sum())

`percent_missing(df)` ¶

Compute the percentage of missing values in a dataframe.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Dataframe to inspect.	required

Returns:

Type	Description
`float`	Percentage of missing values in `[0, 100]`.

Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py

def percent_missing(df: pd.DataFrame) -> float:
    """Compute the percentage of missing values in a dataframe.

    Args:
        df: Dataframe to inspect.

    Returns:
        Percentage of missing values in ``[0, 100]``.
    """
    r, c = df.shape
    total_cells = r * c
    if total_cells == 0:
        return 0.0
    total_missing = count_missing(df)
    return 100.0 * total_missing / total_cells

stats

stats ¶

count_memorized_lines(df1, df2) ¶

get_categorical_field_distribution(field) ¶

get_numeric_distribution_bins(training, synthetic) ¶

get_numeric_field_distribution(field, bins) ¶

compute_distribution_distance(d1, d2) ¶

calculate_pearsons_r(x, y, opt) ¶

calculate_correlation_ratio(x, y, opt) ¶

calculate_theils_u(x, y) ¶

calculate_correlation(df, nominal_columns=None, job_count=_DEFAULT_JOB_COUNT, opt=False) ¶

normalize_dataset(df) ¶

compute_pca(df, n_components=2) ¶

compute_joined_pcas(reference_df, output_df, n_components=2, include_variance=False) ¶

count_missing(df) ¶

percent_missing(df) ¶

`stats` ¶

`count_memorized_lines(df1, df2)` ¶

`get_categorical_field_distribution(field)` ¶

`get_numeric_distribution_bins(training, synthetic)` ¶

`get_numeric_field_distribution(field, bins)` ¶

`compute_distribution_distance(d1, d2)` ¶

`calculate_pearsons_r(x, y, opt)` ¶

`calculate_correlation_ratio(x, y, opt)` ¶

`calculate_theils_u(x, y)` ¶

`calculate_correlation(df, nominal_columns=None, job_count=_DEFAULT_JOB_COUNT, opt=False)` ¶

`normalize_dataset(df)` ¶

`compute_pca(df, n_components=2)` ¶

`compute_joined_pcas(reference_df, output_df, n_components=2, include_variance=False)` ¶

`count_missing(df)` ¶

`percent_missing(df)` ¶