Skip to content

text_structure_similarity

text_structure_similarity

Classes:

Name Description
TextDataSetStatistics

Per-column text structure statistics (sentence count, word length, etc.).

TextStructureSimilarity

Text Structure Similarity metric.

TextDataSetStatistics pydantic-model

Bases: BaseModel

Per-column text structure statistics (sentence count, word length, etc.).

Config:

  • arbitrary_types_allowed: True

Fields:

row_count = 0 pydantic-field

Number of non-empty records analyzed (after dropping NAs and optional downsampling).

column_count = 0 pydantic-field

Always 1; each instance describes a single text column.

duplicate_lines = 0 pydantic-field

Number of text values appearing in both reference and synthetic series. Populated on the synthetic instance only; 0 on reference.

missing_values = 0 pydantic-field

Always 0; NAs are dropped during preprocessing before statistics are computed.

unique_values = 0 pydantic-field

Number of distinct values in the preprocessed text series.

per_record_statistics = pd.DataFrame() pydantic-field

DataFrame with per-record sentence_count, average_words_per_sentence, and average_characters_per_word.

average_sentence_count = 0 pydantic-field

Mean sentence count per record.

average_words_per_sentence = 0 pydantic-field

Mean per-record words-per-sentence ratio.

average_characters_per_word = 0 pydantic-field

Mean per-record characters-per-word ratio.

text_statistic_score = None pydantic-field

JS-divergence-based similarity score comparing reference and synthetic structure. Populated on the synthetic instance only.

TextStructureSimilarity pydantic-model

Bases: Component

Text Structure Similarity metric.

Compares per-record sentence count, words-per-sentence, and characters-per-word distributions between reference and output text columns using Jensen-Shannon divergence.

Fields:

training_statistics = dict() pydantic-field

Per-column text structure statistics for the reference data.

synthetic_statistics = dict() pydantic-field

Per-column text structure statistics for the synthetic data.

jinja_context cached property

Template context with per-column text structure histogram figures.

from_evaluation_dataset(evaluation_dataset, config=None) staticmethod

Compute text structure similarity across all text columns.

Source code in src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py
@staticmethod
def from_evaluation_dataset(
    evaluation_dataset: EvaluationDataset, config: SafeSynthesizerParameters | None = None
) -> TextStructureSimilarity:
    """Compute text structure similarity across all text columns."""
    text_fields = [
        f.name for f in evaluation_dataset.evaluation_fields if f.reference_field_features.type == FieldType.TEXT
    ]

    training = evaluation_dataset.reference
    synthetic = evaluation_dataset.output
    nrows = min(len(evaluation_dataset.reference), len(evaluation_dataset.output))

    # Initialize a stub instance before trying anything.
    training_statistics_dict = dict()
    synthetic_statistics_dict = dict()

    try:
        for field in text_fields:
            try:
                training = evaluation_dataset.reference[field]
                synthetic = evaluation_dataset.output[field]

                training = TextStructureSimilarity._preprocess_text_data(training, nrows)
                synthetic = TextStructureSimilarity._preprocess_text_data(synthetic, nrows)

                # Text statistics.
                training_statistics = TextStructureSimilarity._get_text_statistics(training)
                synthetic_statistics = TextStructureSimilarity._get_text_statistics(synthetic)
                synthetic_statistics.duplicate_lines = TextStructureSimilarity._count_duplicate_lines(
                    training, synthetic
                )

                synthetic_statistics.text_statistic_score = TextStructureSimilarity._get_text_statistics_score(
                    training_statistics=training_statistics, synthetic_statistics=synthetic_statistics
                )

                training_statistics_dict[field] = training_statistics
                synthetic_statistics_dict[field] = synthetic_statistics
            except Exception:
                logger.exception(f"Failed to calculate Text Structure stats for field {field}.")
                continue

        total_score = 0
        for val in synthetic_statistics_dict.values():
            total_score += val.text_statistic_score.score
        if len(synthetic_statistics_dict) > 0:
            score = total_score / len(synthetic_statistics_dict)
            text_statistic_score = EvaluationScore.finalize_grade(score, score)
        else:
            text_statistic_score = EvaluationScore()

        return TextStructureSimilarity(
            score=text_statistic_score,
            training_statistics=training_statistics_dict,
            synthetic_statistics=synthetic_statistics_dict,
        )

    except Exception as e:
        logger.exception("Failed to initialize Text Structure Similarity.")
        return TextStructureSimilarity(score=EvaluationScore(notes=str(e)))