Skip to content

dataset_statistics

dataset_statistics

Classes:

Name Description
DatasetStatistics

Summary statistics for the reference and output datasets.

DatasetStatistics pydantic-model

Bases: Component

Summary statistics for the reference and output datasets.

Reports row/column counts, missing-value percentages, and the number of memorized (verbatim-repeated) rows. This component does not produce a numeric score -- it provides context for the HTML report.

Fields:

reference_rows = 0 pydantic-field

Row count of the reference dataframe used for evaluation.

reference_cols = 0 pydantic-field

Column count of the reference dataframe used for evaluation.

reference_missing = 0 pydantic-field

Percentage of missing values in the reference dataframe.

output_rows = 0 pydantic-field

Row count of the output dataframe used for evaluation.

output_cols = 0 pydantic-field

Column count of the output dataframe used for evaluation.

output_missing = 0 pydantic-field

Percentage of missing values in the output dataframe.

memorized_lines = 0 pydantic-field

Number of exact row matches between reference and output.

jinja_context cached property

Template context merging all dataset summary fields into the base context.

from_evaluation_dataset(evaluation_dataset, config=None) staticmethod

Compute summary statistics from the evaluation dataset.

Source code in src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py
@staticmethod
def from_evaluation_dataset(
    evaluation_dataset: EvaluationDataset, config: SafeSynthesizerParameters | None = None
) -> DatasetStatistics:
    """Compute summary statistics from the evaluation dataset."""
    return DatasetStatistics(
        score=EvaluationScore(),
        reference_rows=evaluation_dataset.reference_rows,
        reference_cols=evaluation_dataset.reference_cols,
        reference_missing=int(stats.percent_missing(evaluation_dataset.reference)),
        output_rows=evaluation_dataset.output_rows,
        output_cols=evaluation_dataset.output_cols,
        output_missing=int(stats.percent_missing(evaluation_dataset.output)),
        memorized_lines=evaluation_dataset.memorized_lines,
    )