Skip to content

metadata

metadata

Classes:

Name Description
EntityMetadata
TypeMetadata
FieldMetadata
EntitySummary

Contains entity summary data that is unique by label name

FieldsMetadata
MetadataService

Service that provides functionality to label records and also track model_metadata across whole dataset.

EntityMetadata(label, count, f_ratio, approx_cardinality, sources, field_label_f_ratio) dataclass

Attributes:

Name Type Description
label str

Label of detected entity.

count int

Number of times this entity was detected.

f_ratio float

Equal to (number of values with this entity)/(total number of values for this field).

approx_cardinality int

How many distinct values there were for this entity type.

sources list[str]

A list of unique sources that contributed predictions

field_label_f_ratio float

The ratio of (column spanning entity matches)/(total number of field values).

label instance-attribute

Label of detected entity.

count instance-attribute

Number of times this entity was detected.

f_ratio instance-attribute

Equal to (number of values with this entity)/(total number of values for this field).

approx_cardinality instance-attribute

How many distinct values there were for this entity type.

sources instance-attribute

A list of unique sources that contributed predictions to the entity summary.

field_label_f_ratio instance-attribute

The ratio of (column spanning entity matches)/(total number of field values). This field is used to determine if an entity should be applied as a field_label in transformation pipelines.

TypeMetadata(type, count) dataclass

Attributes:

Name Type Description
type str

Type of the values in the dataset.

count int

Number of times this type appeared in the values of a field.

type instance-attribute

Type of the values in the dataset. See :func:common.records.base.get_type_as_string for list of types.

count instance-attribute

Number of times this type appeared in the values of a field.

FieldMetadata(field, count, approx_cardinality, missing, pct_missing, pct_total_unique, s_score, entities=list(), types=list(), field_labels=list(), field_attributes=list()) dataclass

Attributes:

Name Type Description
count int

Number of times this field appeared in the dataset.

approx_cardinality int

How many distinct values this field have in the dataset (approximate).

missing int

Number of records that didn't contain this field.

pct_missing float

Percent of missing in the whole dataset [0-100].

pct_total_unique float

Percent of unique values in the whole dataset [0-100].

s_score float

Sensitivity score [0-1].

entities list[EntityMetadata]

List of entities detected in values of this field.

types list[TypeMetadata]

List of types detected in values of this field.

field_labels list[str]

Labels detected for this field.

field_attributes list[FieldAttribute]

Attributes detected for this field.

count instance-attribute

Number of times this field appeared in the dataset.

approx_cardinality instance-attribute

How many distinct values this field have in the dataset (approximate).

missing instance-attribute

Number of records that didn't contain this field.

pct_missing instance-attribute

Percent of missing in the whole dataset [0-100].

pct_total_unique instance-attribute

Percent of unique values in the whole dataset [0-100]. This is equal to 100, when all values for this field are unique.

s_score instance-attribute

Sensitivity score [0-1].

It's equal to: - 1.0, when all values are unique and there are no values missing. - moving toward 0.0 with missing values and/or many values that are repeated.

The general idea was to quickly highlight columns you might want to pay attention to for special handling in either transforms or synthesizer, for one reason or another.

entities = Field(default_factory=list) class-attribute instance-attribute

List of entities detected in values of this field.

types = Field(default_factory=list) class-attribute instance-attribute

List of types detected in values of this field.

field_labels = Field(default_factory=list) class-attribute instance-attribute

Labels detected for this field.

field_attributes = Field(default_factory=list) class-attribute instance-attribute

Attributes detected for this field.

EntitySummary(label, fields, count, approx_distinct_count, sources) dataclass

Contains entity summary data that is unique by label name

Attributes:

Name Type Description
label str

Name of the entity or label.

fields list[str]

Fields containing the entity or label.

count int

Total number of entities found in the dataset.

approx_distinct_count int

Approximate total number of unique entity values

sources list[str]

A list of unique sources that contributed predictions

label instance-attribute

Name of the entity or label.

fields instance-attribute

Fields containing the entity or label.

count instance-attribute

Total number of entities found in the dataset.

approx_distinct_count instance-attribute

Approximate total number of unique entity values found in the dataset. This value is collected using an HLL datastructure.

sources instance-attribute

A list of unique sources that contributed predictions to the entity summary.

FieldsMetadata(fields=list(), entities=list()) dataclass

Attributes:

Name Type Description
fields list[FieldMetadata]

List of fields in the dataset.

entities list[EntitySummary]

List of entities in the dataset. Unique by entity label and score.

fields = Field(default_factory=list) class-attribute instance-attribute

List of fields in the dataset. Note: This list is ordered in the same order that original dataset was ordered.

entities = Field(default_factory=list) class-attribute instance-attribute

List of entities in the dataset. Unique by entity label and score.

MetadataService(ner, field_label_condition=None)

Service that provides functionality to label records and also track model_metadata across whole dataset.

It uses NER for the labeling itself and tracks labels across fields.

Methods:

Name Description
add_field_names

Adds names of all fields that should be tracked.

get_metadata

Returns dataset model_metadata based on records that were labeled to this point.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/metadata.py
def __init__(
    self,
    ner: NER | NERParallel,
    field_label_condition: FieldLabelCondition = None,
):
    self.ner = ner
    self.dataset_metadata_tracker = _DatasetMetadataTracker(field_label_condition=field_label_condition)  # noqa: F821

add_field_names(field_names)

Adds names of all fields that should be tracked. This is necessary to track fields that can be present in the dataset, but have no values. For example for a CSV file, where there is a header "my_field", but the whole column is empty, we still want to report model_metadata on that field.

Parameters:

Name Type Description Default
field_names list[str]

Names of the fields to be initialized. These names should be in the same order as they appear in the dataset.

required
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/metadata.py
def add_field_names(self, field_names: list[str]):
    """
    Adds names of all fields that should be tracked.
    This is necessary to track fields that can be present in the dataset,
    but have no values.
    For example for a CSV file, where there is a header "my_field", but the whole
    column is empty, we still want to report model_metadata on that field.

    Args:
        field_names: Names of the fields to be initialized. These names should be in the
            same order as they appear in the dataset.
    """
    self.dataset_metadata_tracker.add_field_names(field_names)

get_metadata()

Returns dataset model_metadata based on records that were labeled to this point.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/metadata.py
def get_metadata(self) -> DatasetMetadata:
    """Returns dataset model_metadata based on records that were labeled to this point."""
    return self.dataset_metadata_tracker.get_snapshot()