dataset_statistics

`dataset_statistics` ¶

Classes:

Name	Description
`DatasetStatistics`	Summary statistics for the training and synthetic datasets.

`DatasetStatistics` `pydantic-model` ¶

Bases: Component

Summary statistics for the training and synthetic datasets.

Reports row/column counts, missing-value percentages, and the number of memorized (verbatim-repeated) rows. This component does not produce a numeric score -- it provides context for the HTML report.

Fields:

score (EvaluationScore)
name (str)
training_rows (int)
training_cols (int)
training_missing (int)
synthetic_rows (int)
synthetic_cols (int)
synthetic_missing (int)
memorized_lines (int)

`training_rows = 0` `pydantic-field` ¶

Row count of the training dataframe used for evaluation.

`training_cols = 0` `pydantic-field` ¶

Column count of the training dataframe used for evaluation.

`training_missing = 0` `pydantic-field` ¶

Percentage of missing values in the training dataframe.

`synthetic_rows = 0` `pydantic-field` ¶

Row count of the synthetic dataframe used for evaluation.

`synthetic_cols = 0` `pydantic-field` ¶

Column count of the synthetic dataframe used for evaluation.

`synthetic_missing = 0` `pydantic-field` ¶

Percentage of missing values in the synthetic dataframe.

`memorized_lines = 0` `pydantic-field` ¶

Number of exact row matches between training and synthetic.

`jinja_context` `cached` `property` ¶

Template context merging all dataset summary fields into the base context.

`from_evaluation_datasets(evaluation_datasets, config=None)` `staticmethod` ¶

Compute summary statistics from the evaluation dataset.

Source code in src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py

@staticmethod
def from_evaluation_datasets(
    evaluation_datasets: EvaluationDatasets, config: SafeSynthesizerParameters | None = None
) -> DatasetStatistics:
    """Compute summary statistics from the evaluation dataset."""
    return DatasetStatistics(
        score=EvaluationScore(),
        training_rows=evaluation_datasets.training_rows,
        training_cols=evaluation_datasets.training_cols,
        training_missing=int(stats.percent_missing(evaluation_datasets.training)),
        synthetic_rows=evaluation_datasets.synthetic_rows,
        synthetic_cols=evaluation_datasets.synthetic_cols,
        synthetic_missing=int(stats.percent_missing(evaluation_datasets.synthetic)),
        memorized_lines=evaluation_datasets.memorized_lines,
    )

dataset_statistics