text_structure_similarity
text_structure_similarity
¶
Classes:
| Name | Description |
|---|---|
TextDataSetStatistics |
Per-column text structure statistics (sentence count, word length, etc.). |
TextStructureSimilarity |
Text Structure Similarity metric. |
TextDataSetStatistics
pydantic-model
¶
Bases: BaseModel
Per-column text structure statistics (sentence count, word length, etc.).
Config:
arbitrary_types_allowed:True
Fields:
-
row_count(int) -
column_count(int) -
duplicate_lines(int) -
missing_values(int) -
unique_values(int) -
per_record_statistics(DataFrame) -
average_sentence_count(float) -
average_words_per_sentence(float) -
average_characters_per_word(float) -
text_statistic_score(EvaluationScore | None)
row_count = 0
pydantic-field
¶
Number of non-empty records analyzed (after dropping NAs and optional downsampling).
column_count = 0
pydantic-field
¶
Always 1; each instance describes a single text column.
duplicate_lines = 0
pydantic-field
¶
Number of text values appearing in both reference and synthetic series. Populated on the synthetic instance only; 0 on reference.
missing_values = 0
pydantic-field
¶
Always 0; NAs are dropped during preprocessing before statistics are computed.
unique_values = 0
pydantic-field
¶
Number of distinct values in the preprocessed text series.
per_record_statistics = pd.DataFrame()
pydantic-field
¶
DataFrame with per-record sentence_count, average_words_per_sentence, and average_characters_per_word.
average_sentence_count = 0
pydantic-field
¶
Mean sentence count per record.
average_words_per_sentence = 0
pydantic-field
¶
Mean per-record words-per-sentence ratio.
average_characters_per_word = 0
pydantic-field
¶
Mean per-record characters-per-word ratio.
text_statistic_score = None
pydantic-field
¶
JS-divergence-based similarity score comparing reference and synthetic structure. Populated on the synthetic instance only.
TextStructureSimilarity
pydantic-model
¶
Bases: Component
Text Structure Similarity metric.
Compares per-record sentence count, words-per-sentence, and characters-per-word distributions between reference and output text columns using Jensen-Shannon divergence.
Fields:
-
score(EvaluationScore) -
name(str) -
training_statistics(dict[str, TextDataSetStatistics]) -
synthetic_statistics(dict[str, TextDataSetStatistics])
training_statistics = dict()
pydantic-field
¶
Per-column text structure statistics for the reference data.
synthetic_statistics = dict()
pydantic-field
¶
Per-column text structure statistics for the synthetic data.
jinja_context
cached
property
¶
Template context with per-column text structure histogram figures.
from_evaluation_dataset(evaluation_dataset, config=None)
staticmethod
¶
Compute text structure similarity across all text columns.