regex
regex
¶
Classes:
| Name | Description |
|---|---|
Pattern |
Represents a single regex pattern and its settings on how to be |
RegexPredictor |
Base class that represents a single entity. |
PhraseMatcherBuilder |
Build specialized |
Functions:
| Name | Description |
|---|---|
split_header_contexts |
Split a list of strings and RePatterns into two distcit regexes. |
phrase_predictors_from_entity_ruler |
Given a list of Spacy EntityRuler patterns, create |
create_exact_field_matcher |
Helper function that takes a full string that should be matched |
Pattern(pattern, context_score=Score.HIGH, raw_score=Score.LOW, ignore_raw_score=False, header_contexts=list(), neg_header_contexts=list(), header_context_source=Predictor.KEY, span_contexts=list())
dataclass
¶
Represents a single regex pattern and its settings on how to be applied.
Attributes:
| Name | Type | Description |
|---|---|---|
context_score |
Optional[float]
|
This is the optimal score that you want to assign when context exists |
raw_score |
Optional[float]
|
This is the score that gets applied if there are no |
ignore_raw_score |
bool
|
If set, do not emit a match if only the raw regex matches without any context |
header_contexts |
Optional[list[str | Pattern]]
|
A list of strings or regexes that should be used to check the |
neg_header_contexts |
Optional[list[str | Pattern]]
|
A list of strings or regexes that can be used to disqualify a field from being analyzed. |
header_context_source |
int
|
If doing header context searching, this dictates where to search for the context. We default |
span_contexts |
Optional[ContextSpan | list[ContextSpan]]
|
A list of |
context_score = Score.HIGH
class-attribute
instance-attribute
¶
This is the optimal score that you want to assign when context exists either in the header name or the surrounding text. We default this to high.
raw_score = Score.LOW
class-attribute
instance-attribute
¶
This is the score that gets applied if there are no matching contexts. We default this to low.
ignore_raw_score = False
class-attribute
instance-attribute
¶
If set, do not emit a match if only the raw regex matches without any context
header_contexts = field(default_factory=list)
class-attribute
instance-attribute
¶
A list of strings or regexes that should be used to check the
name of the field / header for a match. If there are any matches here, then
the context_score value will be used as the matched score
neg_header_contexts = field(default_factory=list)
class-attribute
instance-attribute
¶
A list of strings or regexes that can be used to disqualify a field from being analyzed. If used, any matches were will short-circuit processing for a given key/value pair.
header_context_source = Predictor.KEY
class-attribute
instance-attribute
¶
If doing header context searching, this dictates where to search for the context. We default to only searching within the field name itself. But we can also search the value or the concatenation of the field name and value
span_contexts = field(default_factory=list)
class-attribute
instance-attribute
¶
A list of ContextSpan instances that will be used, if provided, to
search surrounding text of a string match for other discrete strings or
matching regular expressions. See the ContextSpan usage for more details.
RegexPredictor(name=None, patterns=None, entity=None, namespace=None)
¶
Bases: Predictor
Base class that represents a single entity.
Entities are matched based on a set of patterns with varying accuracy scores.
Methods:
| Name | Description |
|---|---|
validate_match |
A base method for regex rules to implement. |
filter_by_range_by_score |
Filter predictions by text range and take max score. |
evaluate |
Given a single record determine if any |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
validate_match(matched_text, original_text)
¶
A base method for regex rules to implement.
The validate function is used to confirm
an entity match. If the return value is not
None the max score for that entity will
be used.
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
filter_by_range_by_score(field_matches)
¶
Filter predictions by text range and take max score.
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
evaluate(in_record, res_by_field=False)
¶
Given a single record determine if any entities are represented.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
in_record
|
JSONRecord
|
the record to match patterns against |
required |
Returns:
| Type | Description |
|---|---|
list[NERPrediction]
|
A list of entity predictions sorted by score. Top score is |
list[NERPrediction]
|
first entry in list. |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | |
PhraseMatcherBuilder(name, *, namespace='safe-synthesizer')
¶
Build specialized RegexPredictor objects that are designed to match
on phrases in a many-to-one relationship between phrases and entities. This
utilizes RegexPredictors and a simple way by constructing single regexes
that logically "OR" together many phrases and simply set the raw score to
HIGH.
Once this is init'd, phrases can be added and the regex patterns will be
mapped per-entity. A list of RegexPredictors can be exported at any time.
Methods:
| Name | Description |
|---|---|
add_phrase |
Take a simple phrase and modify it to become a regex |
Attributes:
| Name | Type | Description |
|---|---|---|
phrase_patterns |
dict[str | Entity, PhrasePatterns]
|
Map each unique label to two lists, one for case insensitive and one for |
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
phrase_patterns = defaultdict(PhrasePatterns)
instance-attribute
¶
Map each unique label to two lists, one for case insensitive and one for case sensitive matches
add_phrase(label, phrase, case=False)
¶
Take a simple phrase and modify it to become a regex
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
split_header_contexts(contexts)
¶
Split a list of strings and RePatterns into two distcit regexes.
Returns (regexes, tokens)
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
phrase_predictors_from_entity_ruler(name, er_patterns, entity_map)
¶
Given a list of Spacy EntityRuler patterns, create a phrase matcher predictor.
Source code in src/nemo_safe_synthesizer/pii_replacer/ner/regex.py
create_exact_field_matcher(match)
¶
Helper function that takes a full string that should be matched for exactly in a field name and put it into a regex that supports finding that exact string in potentially flattened fields.
If we are looking for the word "foo" exactly, we want to support looking for it in the following header names:
"foo" "foo.bar" "bar.foo" "bar.foo.baz"