person_name

`person_name` ¶

Custom module for Person Name detection.

Classes:

Name	Description
`WordList`	Where we load the decompressed data from S3 / FS. These attr
`PersonNamePredictor`

`WordList(word_list=frozenset(), headers=None, headers_neg=None, headers_pairs=frozenset(), parts=frozenset())` `dataclass` ¶

Where we load the decompressed data from S3 / FS. These attr names MUST match the ref names of the files from the manifest

Attributes:

Name	Type	Description
`word_list`	`KeywordProcessor`	The master list of actual names
`headers`	`Pattern`	The list of partial header names that can trigger the prediction flow
`headers_neg`	`KeywordProcessor`	A list of header tokens that should not be present to trigger prediction flow
`headers_pairs`	`frozenset[str]`	A list of words that can be combined with the word 'name', this should
`parts`	`frozenset[str]`	These are parts of a name that can be used to match, things like

`word_list = field(default_factory=frozenset)` `class-attribute` `instance-attribute` ¶

The master list of actual names

`headers = None` `class-attribute` `instance-attribute` ¶

The list of partial header names that can trigger the prediction flow

`headers_neg = None` `class-attribute` `instance-attribute` ¶

A list of header tokens that should not be present to trigger prediction flow

`headers_pairs = field(default_factory=frozenset)` `class-attribute` `instance-attribute` ¶

A list of words that can be combined with the word 'name', this should be used to build additional header pairs for analysis

`parts = field(default_factory=frozenset)` `class-attribute` `instance-attribute` ¶

These are parts of a name that can be used to match, things like Mr., Mrs., etc

`PersonNamePredictor()` ¶

Bases: Predictor

Methods:

Name	Description
`check_exact_name_header_data`	The logic here is that

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/person_name.py

def __init__(self):
    super().__init__(self.default_name)
    self.word_list = WordList.init_from_manifest()

`check_exact_name_header_data(field_value)` ¶

The logic here is that 1) Every token must exist in one of three lists 2) At least one of the tokens must exist in the main name list

Source code in src/nemo_safe_synthesizer/pii_replacer/ner/person_name.py

def check_exact_name_header_data(self, field_value: str) -> bool:
    """The logic here is that
    1) Every token must exist in one of three lists
    2) At least one of the tokens must exist in the main name list
    """
    tokens = [token.lower() for token in re.findall(TOKEN_REGEX, field_value)]
    _found = False
    for token in tokens:
        if len(token) == 1:
            continue

        in_main_word_list = self.word_list.word_list.extract_keywords(token)

        # if the token is not in any of these lists, fail
        if (
            not in_main_word_list
            and not self.word_list.headers.match(token)
            # and token not in self.word_list.headers
            and token not in self.word_list.parts
        ):
            return False
        # the token is one of these three lists, if we haven't
        # found a token that's also in the main word_list yet,
        # we check that here
        if not _found and in_main_word_list:
            _found = True

    return _found

person_name

person_name ¶

WordList(word_list=frozenset(), headers=None, headers_neg=None, headers_pairs=frozenset(), parts=frozenset()) dataclass ¶

word_list = field(default_factory=frozenset) class-attribute instance-attribute ¶

headers = None class-attribute instance-attribute ¶

headers_neg = None class-attribute instance-attribute ¶

headers_pairs = field(default_factory=frozenset) class-attribute instance-attribute ¶

parts = field(default_factory=frozenset) class-attribute instance-attribute ¶

PersonNamePredictor() ¶

check_exact_name_header_data(field_value) ¶