person_name
person_name
¶
Custom module for Person Name detection.
Classes:
| Name | Description |
|---|---|
WordList |
Where we load the decompressed data from S3 / FS. These attr |
PersonNamePredictor |
|
WordList(word_list=frozenset(), headers=None, headers_neg=None, headers_pairs=frozenset(), parts=frozenset())
dataclass
¶
Where we load the decompressed data from S3 / FS. These attr names MUST match the ref names of the files from the manifest
Attributes:
| Name | Type | Description |
|---|---|---|
word_list |
KeywordProcessor
|
The master list of actual names |
headers |
Pattern
|
The list of partial header names that can trigger the prediction flow |
headers_neg |
KeywordProcessor
|
A list of header tokens that should not be present to trigger prediction flow |
headers_pairs |
frozenset[str]
|
A list of words that can be combined with the word 'name', this should |
parts |
frozenset[str]
|
These are parts of a name that can be used to match, things like |
word_list = field(default_factory=frozenset)
class-attribute
instance-attribute
¶
The master list of actual names
headers = None
class-attribute
instance-attribute
¶
The list of partial header names that can trigger the prediction flow
headers_neg = None
class-attribute
instance-attribute
¶
A list of header tokens that should not be present to trigger prediction flow
headers_pairs = field(default_factory=frozenset)
class-attribute
instance-attribute
¶
A list of words that can be combined with the word 'name', this should be used to build additional header pairs for analysis
parts = field(default_factory=frozenset)
class-attribute
instance-attribute
¶
These are parts of a name that can be used to match, things like Mr., Mrs., etc
PersonNamePredictor()
¶
Bases: Predictor
Methods:
| Name | Description |
|---|---|
check_exact_name_header_data |
The logic here is that |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/person_name.py
check_exact_name_header_data(field_value)
¶
The logic here is that 1) Every token must exist in one of three lists 2) At least one of the tokens must exist in the main name list