base

`base` ¶

Base record representation and field-level tokenization utilities.

Provides BaseRecord -- the abstract base for record types used by the PII replacer -- along with KVPair for representing flattened key-value entries, and helpers for tokenizing field names (tokenize_header, tokenize_on_upper).

Classes:

Name	Description
`KVPair`	A single flattened key-value entry from a record.
`BaseRecord`	Abstract base for structured record representations.

Functions:

Name	Description
`tokenize_on_upper`	Split a camelCase or PascalCase string into lowercase tokens.
`tokenize_header`	Tokenize a field/column name into lowercase word tokens.
`get_type_as_string`	Return the JSON schema type name for a Python scalar value.
`normalize_labels`	Normalize labels by converting them to lowercase.
`normalize_label`	Convert a single label to lowercase.

`KVPair(field, value, scalar_type, array_count, value_path)` ¶

A single flattened key-value entry from a record.

Stores the field name, value, scalar type, nesting depth (array count), and the structural path to the value in the original document.

Parameters:

Name	Type	Description	Default
`field`	`str`	Dot-joined field name (array markers removed).	required
`value`	`str \| Number`	The scalar value.	required
`scalar_type`	`str`	JSON schema type string (`"string"`, `"number"`, etc.).	required
`array_count`	`int`	Number of array levels this value is nested within.	required
`value_path`	`ValuePath`	Structural path tuple identifying the value's location.	required

Methods:

Name	Description
`as_dict`	Serialize to a dictionary of field, value, scalar_type, and array_count.

Attributes:

Name	Type	Description
`json_path`		JSONPath string (e.g., `$.user.emails[0].address`).

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def __init__(
    self,
    field: str,
    value: str | Number,
    scalar_type: str,
    array_count: int,
    value_path: ValuePath,
):
    self.field = field
    self.value = value
    self.scalar_type = scalar_type
    self.array_count = array_count
    self.value_path = value_path
    self.field_tokens = tokenize_header(field)

`json_path` `property` ¶

JSONPath string (e.g., $.user.emails[0].address).

`as_dict()` ¶

Serialize to a dictionary of field, value, scalar_type, and array_count.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def as_dict(self):
    """Serialize to a dictionary of field, value, scalar_type, and array_count."""
    return {
        FIELD: self.field,
        VALUE: self.value,
        SCALAR_TYPE: self.scalar_type,
        ARRAY_COUNT: self.array_count,
    }

`BaseRecord(original)` ¶

Bases: ABC

Abstract base for structured record representations.

Subclasses implement unpack to flatten the original record into a list of KVPair entries and a set of field names.

Parameters:

Name	Type	Description	Default
`original`		The raw record data (typically a dict or string).	required

Methods:

Name	Description
`unpack`	Flatten `self.original` into `self.kv_pairs` and `self.fields`.
`as_dict`	Serialize the record to a dictionary with original data, kv_pairs, and fields.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def __init__(self, original):
    self.original = original
    self.kv_pairs = []
    self.fields = set()
    self.unpack()

`unpack()` `abstractmethod` ¶

Flatten self.original into self.kv_pairs and self.fields.

Must be implemented by subclasses to handle format-specific unpacking (e.g., JSON objects, CSV rows).

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

@abstractmethod
def unpack(self):  # pragma: no cover
    """Flatten ``self.original`` into ``self.kv_pairs`` and ``self.fields``.

    Must be implemented by subclasses to handle format-specific unpacking
    (e.g., JSON objects, CSV rows).
    """
    pass

`as_dict()` ¶

Serialize the record to a dictionary with original data, kv_pairs, and fields.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def as_dict(self):
    """Serialize the record to a dictionary with original data, kv_pairs, and fields."""
    out = {
        ORIGINAL: self.original,
        KV_PAIRS: [p.as_dict() for p in self.kv_pairs],
        FIELDS: list(self.fields),
    }
    return out

`tokenize_on_upper(data)` ¶

Split a camelCase or PascalCase string into lowercase tokens.

Parameters:

Name	Type	Description	Default
`data`	`str`	String to tokenize.	required

Returns:

Type	Description
`list[str]`	List of lowercase token strings, or an empty list if `data` is empty.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def tokenize_on_upper(data: str) -> list[str]:
    """Split a camelCase or PascalCase string into lowercase tokens.

    Args:
        data: String to tokenize.

    Returns:
        List of lowercase token strings, or an empty list if ``data`` is empty.
    """
    if not data:
        return []
    out = []
    curr = []
    curr.append(data[0])
    for i in range(1, len(data)):
        if data[i].isupper() and data[i - 1].islower():
            out.append("".join(curr).casefold())
            curr = []
        elif i < len(data) - 1 and data[i].isupper() and data[i + 1].islower():
            out.append("".join(curr).casefold())
            curr = []
        curr.append(data[i])
    out.append("".join(curr).casefold())
    return out

`tokenize_header(field)` ¶

Tokenize a field/column name into lowercase word tokens.

Underscores are treated as separators, and camelCase boundaries are split via tokenize_on_upper.

Parameters:

Name	Type	Description	Default
`field`	`str`	Field name to tokenize.	required

Returns:

Type	Description
`list[str]`	List of lowercase word tokens extracted from the field name.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def tokenize_header(field: str) -> list[str]:
    """Tokenize a field/column name into lowercase word tokens.

    Underscores are treated as separators, and camelCase boundaries are
    split via ``tokenize_on_upper``.

    Args:
        field: Field name to tokenize.

    Returns:
        List of lowercase word tokens extracted from the field name.
    """
    out = []
    # For header tokenization we don't consider `_` a word character,
    # so replace with `-` to split on it.
    field = field.replace("_", "-")
    base_tokens = re.findall(WORD_TOKENIZER, field)
    for token in base_tokens:
        out.extend(tokenize_on_upper(token))
    return out

`get_type_as_string(value)` ¶

Return the JSON schema type name for a Python scalar value.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def get_type_as_string(value) -> str:
    """Return the JSON schema type name for a Python scalar value."""
    if isinstance(value, str):
        return STRING
    elif isinstance(value, bool):
        if str(value) in ("True", "False"):
            return BOOL
    elif isinstance(value, Number):
        return NUMBER
    elif value is None:
        return NULL

    return NULL

`normalize_labels(labels)` ¶

Normalize labels by converting them to lowercase.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def normalize_labels(labels: Iterable[str]) -> set[str]:
    """Normalize labels by converting them to lowercase."""
    return {normalize_label(label) for label in labels}

`normalize_label(label)` ¶

Convert a single label to lowercase.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py

def normalize_label(label: str) -> str:
    """Convert a single label to lowercase."""
    return label.lower()

base

base ¶

KVPair(field, value, scalar_type, array_count, value_path) ¶

json_path property ¶

as_dict() ¶

BaseRecord(original) ¶

unpack() abstractmethod ¶

as_dict() ¶

tokenize_on_upper(data) ¶

tokenize_header(field) ¶

get_type_as_string(value) ¶

normalize_labels(labels) ¶

normalize_label(label) ¶

`base` ¶

`KVPair(field, value, scalar_type, array_count, value_path)` ¶

`json_path` `property` ¶

`as_dict()` ¶

`BaseRecord(original)` ¶

`unpack()` `abstractmethod` ¶

`as_dict()` ¶

`tokenize_on_upper(data)` ¶

`tokenize_header(field)` ¶

`get_type_as_string(value)` ¶

`normalize_labels(labels)` ¶

`normalize_label(label)` ¶