stats
stats
¶
Statistical functions for synthetic data evaluation.
Provides distribution comparison (Jensen-Shannon), correlation matrix computation (Pearson, Theil's U, Correlation Ratio), PCA, and data-overlap detection used by the evaluation components.
Functions:
| Name | Description |
|---|---|
count_memorized_lines |
Count exact row matches between training and synthetic data. |
get_categorical_field_distribution |
Compute the normalized value-count distribution of a categorical column. |
get_numeric_distribution_bins |
Compute shared histogram bin edges for two numeric series. |
get_numeric_field_distribution |
Compute the normalized distribution of a numeric column cut into bins. |
compute_distribution_distance |
Compute the Jensen-Shannon distance between two distributions. |
calculate_pearsons_r |
Compute the Pearson correlation coefficient for a column pair. |
calculate_correlation_ratio |
Compute the Correlation Ratio for a categorical-numeric column pair. |
calculate_theils_u |
Compute Theil's U (uncertainty coefficient) for two categorical columns. |
calculate_correlation |
Build a full correlation matrix using Pearson, Theil's U, and Correlation Ratio. |
normalize_dataset |
Normalize a dataframe for PCA: fill missing values, encode categoricals, and standardize. |
compute_pca |
Run PCA on a single dataframe after normalization. |
compute_joined_pcas |
Run joined PCA: fit on reference, transform both reference and output. |
count_missing |
Count total missing (NaN/null) values across all cells. |
percent_missing |
Compute the percentage of missing values in a dataframe. |
count_memorized_lines(df1, df2)
¶
Count exact row matches between training and synthetic data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df1
|
DataFrame
|
Training dataframe. |
required |
df2
|
DataFrame
|
Synthetic dataframe. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of rows present in both dataframes after deduplication. |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
get_categorical_field_distribution(field)
¶
Compute the normalized value-count distribution of a categorical column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
Series
|
Column series to analyze. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Mapping of |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
get_numeric_distribution_bins(training, synthetic)
¶
Compute shared histogram bin edges for two numeric series.
Uses the "doane" strategy on the combined data, falling back to
500 fixed bins if the result is empty or too large.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
training
|
Series
|
Numeric series from the training dataframe. |
required |
synthetic
|
Series
|
Numeric series from the synthetic dataframe. |
required |
Returns:
| Type | Description |
|---|---|
|
Array of bin edges spanning both series. |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
get_numeric_field_distribution(field, bins)
¶
Compute the normalized distribution of a numeric column cut into bins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
Series
|
Numeric column series. |
required |
bins
|
Bin edges (typically from |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Mapping of |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
compute_distribution_distance(d1, d2)
¶
Compute the Jensen-Shannon distance between two distributions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
d1
|
dict
|
First distribution dict (values are a probability vector). |
required |
d2
|
dict
|
Second distribution dict. |
required |
Returns:
| Type | Description |
|---|---|
float
|
JS distance in |
float
|
sums to zero. |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
calculate_pearsons_r(x, y, opt)
¶
Compute the Pearson correlation coefficient for a column pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Series | ndarray
|
First input array. |
required |
y
|
Series | ndarray
|
Second input array. |
required |
opt
|
bool
|
If |
required |
Returns:
| Type | Description |
|---|---|
tuple[float, float]
|
Tuple of (Pearson r, two-tailed p-value). |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
calculate_correlation_ratio(x, y, opt)
¶
Compute the Correlation Ratio for a categorical-numeric column pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
Series
|
Categorical input array. |
required |
y
|
Series
|
Numeric input array. |
required |
opt
|
bool
|
If |
required |
Returns:
| Type | Description |
|---|---|
float
|
Correlation ratio in |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
calculate_theils_u(x, y)
¶
Compute Theil's U (uncertainty coefficient) for two categorical columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
First categorical array. |
required | |
y
|
Second categorical array. |
required |
Returns:
| Type | Description |
|---|---|
|
Theil's U in |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
calculate_correlation(df, nominal_columns=None, job_count=_DEFAULT_JOB_COUNT, opt=False)
¶
Build a full correlation matrix using Pearson, Theil's U, and Correlation Ratio.
Numeric-numeric pairs use Pearson's r, categorical-categorical pairs use Theil's U, and categorical-numeric pairs use Correlation Ratio (or Theil's U for highly-unique categoricals).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe. |
required |
nominal_columns
|
list[str] | None
|
Columns to treat as categorical. |
None
|
job_count
|
int
|
Number of parallel jobs for pairwise computations. |
_DEFAULT_JOB_COUNT
|
opt
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Square correlation dataframe indexed and columned by |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 | |
normalize_dataset(df)
¶
Normalize a dataframe for PCA: fill missing values, encode categoricals, and standardize.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Raw dataframe to prepare. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Standardized dataframe (mean 0, std 1) with all columns numeric. |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
compute_pca(df, n_components=2)
¶
Run PCA on a single dataframe after normalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe to decompose. |
required |
n_components
|
int
|
Number of principal components to keep. |
2
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Dataframe with columns |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
compute_joined_pcas(reference_df, output_df, n_components=2, include_variance=False)
¶
Run joined PCA: fit on reference, transform both reference and output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reference_df
|
DataFrame
|
Training dataframe (used to fit the scaler and PCA). |
required |
output_df
|
DataFrame
|
Synthetic dataframe (transformed only). |
required |
n_components
|
int
|
Number of principal components to keep. |
2
|
include_variance
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
Tuple of (reference PCA dataframe, output PCA dataframe). |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
count_missing(df)
¶
Count total missing (NaN/null) values across all cells.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe to inspect. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Total number of missing values. |
Source code in src/nemo_safe_synthesizer/evaluation/statistics/stats.py
percent_missing(df)
¶
Compute the percentage of missing values in a dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Dataframe to inspect. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Percentage of missing values in |