dataset_statistics

`dataset_statistics` ¶

Classes:

Name	Description
`DatasetStatistics`	Summary statistics for the reference and output datasets.

`DatasetStatistics` `pydantic-model` ¶

Bases: Component

Summary statistics for the reference and output datasets.

Reports row/column counts, missing-value percentages, and the number of memorized (verbatim-repeated) rows. This component does not produce a numeric score -- it provides context for the HTML report.

Fields:

score (EvaluationScore)
name (str)
reference_rows (int)
reference_cols (int)
reference_missing (int)
output_rows (int)
output_cols (int)
output_missing (int)
memorized_lines (int)

`reference_rows = 0` `pydantic-field` ¶

Row count of the reference dataframe used for evaluation.

`reference_cols = 0` `pydantic-field` ¶

Column count of the reference dataframe used for evaluation.

`reference_missing = 0` `pydantic-field` ¶

Percentage of missing values in the reference dataframe.

`output_rows = 0` `pydantic-field` ¶

Row count of the output dataframe used for evaluation.

`output_cols = 0` `pydantic-field` ¶

Column count of the output dataframe used for evaluation.

`output_missing = 0` `pydantic-field` ¶

Percentage of missing values in the output dataframe.

`memorized_lines = 0` `pydantic-field` ¶

Number of exact row matches between reference and output.

`jinja_context` `cached` `property` ¶

Template context merging all dataset summary fields into the base context.

`from_evaluation_dataset(evaluation_dataset, config=None)` `staticmethod` ¶

Compute summary statistics from the evaluation dataset.

Source code in src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py

@staticmethod
def from_evaluation_dataset(
    evaluation_dataset: EvaluationDataset, config: SafeSynthesizerParameters | None = None
) -> DatasetStatistics:
    """Compute summary statistics from the evaluation dataset."""
    return DatasetStatistics(
        score=EvaluationScore(),
        reference_rows=evaluation_dataset.reference_rows,
        reference_cols=evaluation_dataset.reference_cols,
        reference_missing=int(stats.percent_missing(evaluation_dataset.reference)),
        output_rows=evaluation_dataset.output_rows,
        output_cols=evaluation_dataset.output_cols,
        output_missing=int(stats.percent_missing(evaluation_dataset.output)),
        memorized_lines=evaluation_dataset.memorized_lines,
    )

dataset_statistics