base
base
¶
Base record representation and field-level tokenization utilities.
Provides BaseRecord -- the abstract base for record types used by the
PII replacer -- along with KVPair for representing flattened key-value
entries, and helpers for tokenizing field names (tokenize_header,
tokenize_on_upper).
Classes:
| Name | Description |
|---|---|
KVPair |
A single flattened key-value entry from a record. |
BaseRecord |
Abstract base for structured record representations. |
Functions:
| Name | Description |
|---|---|
tokenize_on_upper |
Split a camelCase or PascalCase string into lowercase tokens. |
tokenize_header |
Tokenize a field/column name into lowercase word tokens. |
get_type_as_string |
Return the JSON schema type name for a Python scalar value. |
normalize_labels |
Normalize labels by converting them to lowercase. |
normalize_label |
Convert a single label to lowercase. |
KVPair(field, value, scalar_type, array_count, value_path)
¶
A single flattened key-value entry from a record.
Stores the field name, value, scalar type, nesting depth (array count), and the structural path to the value in the original document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
str
|
Dot-joined field name (array markers removed). |
required |
value
|
str | Number
|
The scalar value. |
required |
scalar_type
|
str
|
JSON schema type string ( |
required |
array_count
|
int
|
Number of array levels this value is nested within. |
required |
value_path
|
ValuePath
|
Structural path tuple identifying the value's location. |
required |
Methods:
| Name | Description |
|---|---|
as_dict |
Serialize to a dictionary of field, value, scalar_type, and array_count. |
Attributes:
| Name | Type | Description |
|---|---|---|
json_path |
JSONPath string (e.g., |
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
BaseRecord(original)
¶
Bases: ABC
Abstract base for structured record representations.
Subclasses implement unpack to flatten the original record into a
list of KVPair entries and a set of field names.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
original
|
The raw record data (typically a dict or string). |
required |
Methods:
| Name | Description |
|---|---|
unpack |
Flatten |
as_dict |
Serialize the record to a dictionary with original data, kv_pairs, and fields. |
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
unpack()
abstractmethod
¶
Flatten self.original into self.kv_pairs and self.fields.
Must be implemented by subclasses to handle format-specific unpacking (e.g., JSON objects, CSV rows).
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
as_dict()
¶
Serialize the record to a dictionary with original data, kv_pairs, and fields.
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
tokenize_on_upper(data)
¶
Split a camelCase or PascalCase string into lowercase tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
str
|
String to tokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of lowercase token strings, or an empty list if |
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
tokenize_header(field)
¶
Tokenize a field/column name into lowercase word tokens.
Underscores are treated as separators, and camelCase boundaries are
split via tokenize_on_upper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field
|
str
|
Field name to tokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of lowercase word tokens extracted from the field name. |
Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
get_type_as_string(value)
¶
Return the JSON schema type name for a Python scalar value.