Skip to content

nlp

nlp

Classes:

Name Description
FieldStr

String optimized field representation for NLP prediction pipelines.

FieldStr(field, value_path, offset, text) dataclass

String optimized field representation for NLP prediction pipelines.

Methods:

Name Description
from_kv_pair

Returns a string optimized input for NLP predictions.

spacy_doc_to_ner_prediction

Given a prediction document, return an NERPrediction.

from_kv_pair(pair) classmethod

Returns a string optimized input for NLP predictions.

For example give a k,v pair

{"location": "united states"} this function will

merge the pair into a string

"location is united states"

These merged strings produce better prediction results from our NLP pipeline.

Parameters:

Name Type Description Default
pair KVPair

KVPair from a JSONRecord to merge.

required

Returns:

Type Description
FieldStr

An instance of FieldStr.

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/nlp.py
@classmethod
def from_kv_pair(cls, pair: KVPair) -> FieldStr:
    """Returns a string optimized input for NLP predictions.

    For example give a k,v pair

        {"location": "united states"} this function will

    merge the pair into a string

        "location is united states"

    These merged strings produce better prediction results from our NLP pipeline.

    Args:
        pair: ``KVPair`` from a ``JSONRecord`` to merge.

    Returns:
        An instance of ``FieldStr``.
    """
    prefix = " ".join(pair.field_tokens) + SPACY_DELIM if pair.field else ""
    return cls(
        field=pair.field,
        value_path=pair.value_path,
        offset=len(prefix),
        text=prefix + str(pair.value),
    )

spacy_doc_to_ner_prediction(doc, source, validator=None)

Given a prediction document, return an NERPrediction.

This function will apply a set of rules on a Spacy doc and extract predictions based on those rules. Certain predictions are filtered out based on score and entity type.

This function is also responsible for reconstructing the input string into it's source KVPair. Since Spacy creates spans on texts of different lengths, we account for those lengths during reconstruction.

Parameters:

Name Type Description Default
doc Doc

The spacy doc to extract entities from

required
source str

the model used to create predictions.

required
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/nlp.py
def spacy_doc_to_ner_prediction(
    self, doc: Doc, source: str, validator: Optional[Callable] = None
) -> list[NERPrediction]:
    """Given a prediction document, return an NERPrediction.

    This function will apply a set of rules on a Spacy doc and extract predictions
    based on those rules. Certain predictions are filtered out based on score
    and entity type.

    This function is also responsible for reconstructing the input string into it's
    source KVPair. Since Spacy creates spans on texts of different lengths, we account
    for those lengths during reconstruction.

    Args:
        doc: The spacy doc to extract entities from
        source: the model used to create predictions.
    """
    preds = []
    for ent in doc.ents:
        # Don't create predictions for entities that were found inside the tokenized prefix.
        if ent.start_char - self.offset >= 0:
            if validator is None or validator(ent):
                label = NLP_ENTITY_MAP.get(getattr(ent, "label_"))
                start = ent.start_char - self.offset
                end = ent.end_char - self.offset
                substring_match = end - start + self.offset != len(doc.text)
                if label:
                    pred = NERPrediction(
                        text=ent.text.strip(),
                        start=start,
                        end=end,
                        field=self.field,
                        value_path=self.value_path,
                        label=label.tag,
                        score=doc._.ent_score(label),
                        source=source,  # todo: get source from model instead of cls
                        substring_match=substring_match,
                    )
                    preds.append(pred)
    return preds