metadata
metadata
¶
Classes:
| Name | Description |
|---|---|
EntityMetadata |
|
TypeMetadata |
|
FieldMetadata |
|
EntitySummary |
Contains entity summary data that is unique by label name |
FieldsMetadata |
|
MetadataService |
Service that provides functionality to label records and also track model_metadata across whole dataset. |
EntityMetadata(label, count, f_ratio, approx_cardinality, sources, field_label_f_ratio)
dataclass
¶
Attributes:
| Name | Type | Description |
|---|---|---|
label |
str
|
Label of detected entity. |
count |
int
|
Number of times this entity was detected. |
f_ratio |
float
|
Equal to |
approx_cardinality |
int
|
How many distinct values there were for this entity type. |
sources |
list[str]
|
A list of unique sources that contributed predictions |
field_label_f_ratio |
float
|
The ratio of (column spanning entity matches)/(total number of field values). |
label
instance-attribute
¶
Label of detected entity.
count
instance-attribute
¶
Number of times this entity was detected.
f_ratio
instance-attribute
¶
Equal to (number of values with this entity)/(total number of values for this field).
approx_cardinality
instance-attribute
¶
How many distinct values there were for this entity type.
sources
instance-attribute
¶
A list of unique sources that contributed predictions to the entity summary.
field_label_f_ratio
instance-attribute
¶
The ratio of (column spanning entity matches)/(total number of field values). This field is used to determine if an entity should be applied as a field_label in transformation pipelines.
TypeMetadata(type, count)
dataclass
¶
FieldMetadata(field, count, approx_cardinality, missing, pct_missing, pct_total_unique, s_score, entities=list(), types=list(), field_labels=list(), field_attributes=list())
dataclass
¶
Attributes:
| Name | Type | Description |
|---|---|---|
count |
int
|
Number of times this field appeared in the dataset. |
approx_cardinality |
int
|
How many distinct values this field have in the dataset (approximate). |
missing |
int
|
Number of records that didn't contain this field. |
pct_missing |
float
|
Percent of missing in the whole dataset [0-100]. |
pct_total_unique |
float
|
Percent of unique values in the whole dataset [0-100]. |
s_score |
float
|
Sensitivity score [0-1]. |
entities |
list[EntityMetadata]
|
List of entities detected in values of this field. |
types |
list[TypeMetadata]
|
List of types detected in values of this field. |
field_labels |
list[str]
|
Labels detected for this field. |
field_attributes |
list[FieldAttribute]
|
Attributes detected for this field. |
count
instance-attribute
¶
Number of times this field appeared in the dataset.
approx_cardinality
instance-attribute
¶
How many distinct values this field have in the dataset (approximate).
missing
instance-attribute
¶
Number of records that didn't contain this field.
pct_missing
instance-attribute
¶
Percent of missing in the whole dataset [0-100].
pct_total_unique
instance-attribute
¶
Percent of unique values in the whole dataset [0-100]. This is equal to 100, when all values for this field are unique.
s_score
instance-attribute
¶
Sensitivity score [0-1].
It's equal to: - 1.0, when all values are unique and there are no values missing. - moving toward 0.0 with missing values and/or many values that are repeated.
The general idea was to quickly highlight columns you might want to pay attention to for special handling in either transforms or synthesizer, for one reason or another.
entities = Field(default_factory=list)
class-attribute
instance-attribute
¶
List of entities detected in values of this field.
types = Field(default_factory=list)
class-attribute
instance-attribute
¶
List of types detected in values of this field.
field_labels = Field(default_factory=list)
class-attribute
instance-attribute
¶
Labels detected for this field.
field_attributes = Field(default_factory=list)
class-attribute
instance-attribute
¶
Attributes detected for this field.
EntitySummary(label, fields, count, approx_distinct_count, sources)
dataclass
¶
Contains entity summary data that is unique by label name
Attributes:
| Name | Type | Description |
|---|---|---|
label |
str
|
Name of the entity or label. |
fields |
list[str]
|
Fields containing the entity or label. |
count |
int
|
Total number of entities found in the dataset. |
approx_distinct_count |
int
|
Approximate total number of unique entity values |
sources |
list[str]
|
A list of unique sources that contributed predictions |
label
instance-attribute
¶
Name of the entity or label.
fields
instance-attribute
¶
Fields containing the entity or label.
count
instance-attribute
¶
Total number of entities found in the dataset.
approx_distinct_count
instance-attribute
¶
Approximate total number of unique entity values found in the dataset. This value is collected using an HLL datastructure.
sources
instance-attribute
¶
A list of unique sources that contributed predictions to the entity summary.
FieldsMetadata(fields=list(), entities=list())
dataclass
¶
Attributes:
| Name | Type | Description |
|---|---|---|
fields |
list[FieldMetadata]
|
List of fields in the dataset. |
entities |
list[EntitySummary]
|
List of entities in the dataset. Unique by entity label and score. |
fields = Field(default_factory=list)
class-attribute
instance-attribute
¶
List of fields in the dataset. Note: This list is ordered in the same order that original dataset was ordered.
entities = Field(default_factory=list)
class-attribute
instance-attribute
¶
List of entities in the dataset. Unique by entity label and score.
MetadataService(ner, field_label_condition=None)
¶
Service that provides functionality to label records and also track model_metadata across whole dataset.
It uses NER for the labeling itself and tracks labels across fields.
Methods:
| Name | Description |
|---|---|
add_field_names |
Adds names of all fields that should be tracked. |
get_metadata |
Returns dataset model_metadata based on records that were labeled to this point. |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/metadata.py
add_field_names(field_names)
¶
Adds names of all fields that should be tracked. This is necessary to track fields that can be present in the dataset, but have no values. For example for a CSV file, where there is a header "my_field", but the whole column is empty, we still want to report model_metadata on that field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_names
|
list[str]
|
Names of the fields to be initialized. These names should be in the same order as they appear in the dataset. |
required |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/metadata.py
get_metadata()
¶
Returns dataset model_metadata based on records that were labeled to this point.