Skip to content

dataset_statistics

dataset_statistics

Classes:

Name Description
DatasetStatistics

Summary statistics for the training and synthetic datasets.

DatasetStatistics pydantic-model

Bases: Component

Summary statistics for the training and synthetic datasets.

Reports row/column counts, missing-value percentages, and the number of memorized (verbatim-repeated) rows. This component does not produce a numeric score -- it provides context for the HTML report.

Fields:

training_rows = 0 pydantic-field

Row count of the training dataframe used for evaluation.

training_cols = 0 pydantic-field

Column count of the training dataframe used for evaluation.

training_missing = 0 pydantic-field

Percentage of missing values in the training dataframe.

synthetic_rows = 0 pydantic-field

Row count of the synthetic dataframe used for evaluation.

synthetic_cols = 0 pydantic-field

Column count of the synthetic dataframe used for evaluation.

synthetic_missing = 0 pydantic-field

Percentage of missing values in the synthetic dataframe.

memorized_lines = 0 pydantic-field

Number of exact row matches between training and synthetic.

jinja_context cached property

Template context merging all dataset summary fields into the base context.

from_evaluation_datasets(evaluation_datasets, config=None) staticmethod

Compute summary statistics from the evaluation dataset.

Source code in src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py
@staticmethod
def from_evaluation_datasets(
    evaluation_datasets: EvaluationDatasets, config: SafeSynthesizerParameters | None = None
) -> DatasetStatistics:
    """Compute summary statistics from the evaluation dataset."""
    return DatasetStatistics(
        score=EvaluationScore(),
        training_rows=evaluation_datasets.training_rows,
        training_cols=evaluation_datasets.training_cols,
        training_missing=int(stats.percent_missing(evaluation_datasets.training)),
        synthetic_rows=evaluation_datasets.synthetic_rows,
        synthetic_cols=evaluation_datasets.synthetic_cols,
        synthetic_missing=int(stats.percent_missing(evaluation_datasets.synthetic)),
        memorized_lines=evaluation_datasets.memorized_lines,
    )