Skip to content

person_name

person_name

Custom module for Person Name detection.

Classes:

Name Description
WordList

Where we load the decompressed data from S3 / FS. These attr

PersonNamePredictor

WordList(word_list=frozenset(), headers=None, headers_neg=None, headers_pairs=frozenset(), parts=frozenset()) dataclass

Where we load the decompressed data from S3 / FS. These attr names MUST match the ref names of the files from the manifest

Attributes:

Name Type Description
word_list KeywordProcessor

The master list of actual names

headers Pattern

The list of partial header names that can trigger the prediction flow

headers_neg KeywordProcessor

A list of header tokens that should not be present to trigger prediction flow

headers_pairs frozenset[str]

A list of words that can be combined with the word 'name', this should

parts frozenset[str]

These are parts of a name that can be used to match, things like

word_list = field(default_factory=frozenset) class-attribute instance-attribute

The master list of actual names

headers = None class-attribute instance-attribute

The list of partial header names that can trigger the prediction flow

headers_neg = None class-attribute instance-attribute

A list of header tokens that should not be present to trigger prediction flow

headers_pairs = field(default_factory=frozenset) class-attribute instance-attribute

A list of words that can be combined with the word 'name', this should be used to build additional header pairs for analysis

parts = field(default_factory=frozenset) class-attribute instance-attribute

These are parts of a name that can be used to match, things like Mr., Mrs., etc

PersonNamePredictor()

Bases: Predictor

Methods:

Name Description
check_exact_name_header_data

The logic here is that

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/person_name.py
def __init__(self):
    super().__init__(self.default_name)
    self.word_list = WordList.init_from_manifest()

check_exact_name_header_data(field_value)

The logic here is that 1) Every token must exist in one of three lists 2) At least one of the tokens must exist in the main name list

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/person_name.py
def check_exact_name_header_data(self, field_value: str) -> bool:
    """The logic here is that
    1) Every token must exist in one of three lists
    2) At least one of the tokens must exist in the main name list
    """
    tokens = [token.lower() for token in re.findall(TOKEN_REGEX, field_value)]
    _found = False
    for token in tokens:
        if len(token) == 1:
            continue

        in_main_word_list = self.word_list.word_list.extract_keywords(token)

        # if the token is not in any of these lists, fail
        if (
            not in_main_word_list
            and not self.word_list.headers.match(token)
            # and token not in self.word_list.headers
            and token not in self.word_list.parts
        ):
            return False
        # the token is one of these three lists, if we haven't
        # found a token that's also in the main word_list yet,
        # we check that here
        if not _found and in_main_word_list:
            _found = True

    return _found