Skip to content

base

base

Base record representation and field-level tokenization utilities.

Provides BaseRecord -- the abstract base for record types used by the PII replacer -- along with KVPair for representing flattened key-value entries, and helpers for tokenizing field names (tokenize_header, tokenize_on_upper).

Classes:

Name Description
KVPair

A single flattened key-value entry from a record.

BaseRecord

Abstract base for structured record representations.

Functions:

Name Description
tokenize_on_upper

Split a camelCase or PascalCase string into lowercase tokens.

tokenize_header

Tokenize a field/column name into lowercase word tokens.

get_type_as_string

Return the JSON schema type name for a Python scalar value.

normalize_labels

Normalize labels by converting them to lowercase.

normalize_label

Convert a single label to lowercase.

KVPair(field, value, scalar_type, array_count, value_path)

A single flattened key-value entry from a record.

Stores the field name, value, scalar type, nesting depth (array count), and the structural path to the value in the original document.

Parameters:

Name Type Description Default
field str

Dot-joined field name (array markers removed).

required
value str | Number

The scalar value.

required
scalar_type str

JSON schema type string ("string", "number", etc.).

required
array_count int

Number of array levels this value is nested within.

required
value_path ValuePath

Structural path tuple identifying the value's location.

required

Methods:

Name Description
as_dict

Serialize to a dictionary of field, value, scalar_type, and array_count.

Attributes:

Name Type Description
json_path

JSONPath string (e.g., $.user.emails[0].address).

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def __init__(
    self,
    field: str,
    value: str | Number,
    scalar_type: str,
    array_count: int,
    value_path: ValuePath,
):
    self.field = field
    self.value = value
    self.scalar_type = scalar_type
    self.array_count = array_count
    self.value_path = value_path
    self.field_tokens = tokenize_header(field)

json_path property

JSONPath string (e.g., $.user.emails[0].address).

as_dict()

Serialize to a dictionary of field, value, scalar_type, and array_count.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def as_dict(self):
    """Serialize to a dictionary of field, value, scalar_type, and array_count."""
    return {
        FIELD: self.field,
        VALUE: self.value,
        SCALAR_TYPE: self.scalar_type,
        ARRAY_COUNT: self.array_count,
    }

BaseRecord(original)

Bases: ABC

Abstract base for structured record representations.

Subclasses implement unpack to flatten the original record into a list of KVPair entries and a set of field names.

Parameters:

Name Type Description Default
original

The raw record data (typically a dict or string).

required

Methods:

Name Description
unpack

Flatten self.original into self.kv_pairs and self.fields.

as_dict

Serialize the record to a dictionary with original data, kv_pairs, and fields.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def __init__(self, original):
    self.original = original
    self.kv_pairs = []
    self.fields = set()
    self.unpack()

unpack() abstractmethod

Flatten self.original into self.kv_pairs and self.fields.

Must be implemented by subclasses to handle format-specific unpacking (e.g., JSON objects, CSV rows).

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
@abstractmethod
def unpack(self):  # pragma: no cover
    """Flatten ``self.original`` into ``self.kv_pairs`` and ``self.fields``.

    Must be implemented by subclasses to handle format-specific unpacking
    (e.g., JSON objects, CSV rows).
    """
    pass

as_dict()

Serialize the record to a dictionary with original data, kv_pairs, and fields.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def as_dict(self):
    """Serialize the record to a dictionary with original data, kv_pairs, and fields."""
    out = {
        ORIGINAL: self.original,
        KV_PAIRS: [p.as_dict() for p in self.kv_pairs],
        FIELDS: list(self.fields),
    }
    return out

tokenize_on_upper(data)

Split a camelCase or PascalCase string into lowercase tokens.

Parameters:

Name Type Description Default
data str

String to tokenize.

required

Returns:

Type Description
list[str]

List of lowercase token strings, or an empty list if data is empty.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def tokenize_on_upper(data: str) -> list[str]:
    """Split a camelCase or PascalCase string into lowercase tokens.

    Args:
        data: String to tokenize.

    Returns:
        List of lowercase token strings, or an empty list if ``data`` is empty.
    """
    if not data:
        return []
    out = []
    curr = []
    curr.append(data[0])
    for i in range(1, len(data)):
        if data[i].isupper() and data[i - 1].islower():
            out.append("".join(curr).casefold())
            curr = []
        elif i < len(data) - 1 and data[i].isupper() and data[i + 1].islower():
            out.append("".join(curr).casefold())
            curr = []
        curr.append(data[i])
    out.append("".join(curr).casefold())
    return out

tokenize_header(field)

Tokenize a field/column name into lowercase word tokens.

Underscores are treated as separators, and camelCase boundaries are split via tokenize_on_upper.

Parameters:

Name Type Description Default
field str

Field name to tokenize.

required

Returns:

Type Description
list[str]

List of lowercase word tokens extracted from the field name.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def tokenize_header(field: str) -> list[str]:
    """Tokenize a field/column name into lowercase word tokens.

    Underscores are treated as separators, and camelCase boundaries are
    split via ``tokenize_on_upper``.

    Args:
        field: Field name to tokenize.

    Returns:
        List of lowercase word tokens extracted from the field name.
    """
    out = []
    # For header tokenization we don't consider `_` a word character,
    # so replace with `-` to split on it.
    field = field.replace("_", "-")
    base_tokens = re.findall(WORD_TOKENIZER, field)
    for token in base_tokens:
        out.extend(tokenize_on_upper(token))
    return out

get_type_as_string(value)

Return the JSON schema type name for a Python scalar value.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def get_type_as_string(value) -> str:
    """Return the JSON schema type name for a Python scalar value."""
    if isinstance(value, str):
        return STRING
    elif isinstance(value, bool):
        if str(value) in ("True", "False"):
            return BOOL
    elif isinstance(value, Number):
        return NUMBER
    elif value is None:
        return NULL

    return NULL

normalize_labels(labels)

Normalize labels by converting them to lowercase.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def normalize_labels(labels: Iterable[str]) -> set[str]:
    """Normalize labels by converting them to lowercase."""
    return {normalize_label(label) for label in labels}

normalize_label(label)

Convert a single label to lowercase.

Source code in src/nemo_safe_synthesizer/data_processing/records/base.py
def normalize_label(label: str) -> str:
    """Convert a single label to lowercase."""
    return label.lower()