dataset

`dataset` ¶

DataFrame normalization and JSON schema inference for training data.

Provides utilities for standardizing DataFrames (type coercion, missing-value handling) and deriving JSON schemas used for validating generated records.

Functions:

Name	Description
`check_enum_type`	Return enum schema if the series is an enum, otherwise return None.
`make_json_schema`	Generate a JSON schema from the given DataFrame.
`normalize_dataframe`	Ensure DataFrame meets standards for use in Safe Synthesizer models.
`normalize_column`	Normalize the given pandas series.

`check_enum_type(series, max_distinct=None, max_singletons=None)` ¶

Return enum schema if the series is an enum, otherwise return None.

Parameters:

Name	Type	Description	Default
`series`	`Series`	Data object to check for enum type.	required
`max_distinct`	`int \| float \| None`	Maximum number of distinct values to be considered an enum.	`None`
`max_singletons`	`int \| float \| None`	Maximum number of values with a single occurrence to be considered an enum.	`None`

Returns:

Type	Description
`dict \| None`	The enum schema if the series is an enum, otherwise None.

Source code in src/nemo_safe_synthesizer/data_processing/dataset.py

def check_enum_type(
    series: pd.Series,
    max_distinct: int | float | None = None,
    max_singletons: int | float | None = None,
) -> dict | None:
    """Return enum schema if the series is an enum, otherwise return None.

    Args:
        series: Data object to check for enum type.
        max_distinct: Maximum number of distinct values to be considered an enum.
        max_singletons: Maximum number of values with a single occurrence to be
            considered an enum.

    Returns:
        The enum schema if the series is an enum, otherwise None.
    """
    if max_distinct is None:
        max_distinct = len(series) ** SCHEMA_ENUM_MAX_DISTINCT_EXP
    if max_singletons is None:
        max_singletons = len(series) ** SCHEMA_ENUM_MAX_SINGLETONS_EXP

    value_counts = series.value_counts(dropna=False)

    is_enum = value_counts.count() <= max_distinct and (value_counts == 1).sum() <= max_singletons

    return {"enum": [_handle_enum_value(v) for v in value_counts.index.sort_values()]} if is_enum else None

`make_json_schema(df, string_length_multiple=STRING_LENGTH_MULTIPLE)` ¶

Generate a JSON schema from the given DataFrame.

Inspects each column to determine its JSON type, numeric range, string length bounds, or enum values. See https://json-schema.org for the schema specification.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame to derive a JSON schema from.	required
`string_length_multiple`	`float`	Multiplier applied to observed string lengths to set `minLength` (divided) and `maxLength` (multiplied) bounds in the schema.	`STRING_LENGTH_MULTIPLE`

Returns:

Type	Description
`dict`	A dictionary representing the JSON schema with `type`, `properties`,
`dict`	and `required` keys.

Source code in src/nemo_safe_synthesizer/data_processing/dataset.py

def make_json_schema(df: pd.DataFrame, string_length_multiple: float = STRING_LENGTH_MULTIPLE) -> dict:
    """Generate a JSON schema from the given DataFrame.

    Inspects each column to determine its JSON type, numeric range, string
    length bounds, or enum values. See https://json-schema.org for the
    schema specification.

    Args:
        df: DataFrame to derive a JSON schema from.
        string_length_multiple: Multiplier applied to observed string lengths
            to set ``minLength`` (divided) and ``maxLength`` (multiplied)
            bounds in the schema.

    Returns:
        A dictionary representing the JSON schema with ``type``, ``properties``,
        and ``required`` keys.
    """
    schema = {"type": "object", "properties": {}, "required": []}

    for col in df.columns:
        series = df[col]
        col_schema = check_enum_type(series) or {}

        if not col_schema:
            col_types = [JSON_TYPE_MAP.get(t, "string") for t in series.apply(lambda x: type(x).__name__).unique()]

            col_types = list(set(col_types))

            if series.isna().any():
                col_types.append("null")

            if set(col_types).issubset(["integer", "number"]):
                col_schema.update(
                    {
                        "type": col_types[0],  # actual element instead of list
                        "minimum": float(series.min()),
                        "maximum": float(series.max()),
                    }
                )
            elif col_types == ["string"]:
                str_length = series.astype(str).apply(len)
                col_schema.update(
                    {
                        "type": "string",
                        "minLength": round(str_length.min() / string_length_multiple),
                        "maxLength": round(string_length_multiple * str_length.max()),
                    }
                )
            else:
                col_schema.update({"type": col_types})

        schema["properties"][col] = col_schema

        if not series.isna().any():
            schema["required"].append(col)

    return schema

`normalize_dataframe(df)` ¶

Ensure DataFrame meets standards for use in Safe Synthesizer models.

Pandas may be used to construct a lot of odd DataFrames with weird corner cases that can violate assumptions of downstream code and other libraries. This includes differences creating a DataFrame from different sources (like csv vs json vs jsonl) or when a manually crafted DataFrame is provided, such as using SafeSynthesizerDataset for testing. Rather than be defensive in every model where we use SafeSynthesizerDataset, we do some standardization of all DataFrames here.

Enforced standards, i.e., assumptions models that use SafeSynthesizerDataset may make: - Every column has a single datatype, e.g. all float, all str, or all int, with the exception of missing values in object columns, where we keep the pandas behavior of representing missing with a float numpy.nan for now.

Date, time, datetime, and timedelta types are converted to string for downstream consistency between tokenization and schema serialization. Decimal types are converted to float.

Source code in src/nemo_safe_synthesizer/data_processing/dataset.py

def normalize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Ensure DataFrame meets standards for use in Safe Synthesizer models.

    Pandas may be used to construct a lot of odd DataFrames with weird corner
    cases that can violate assumptions of downstream code and other libraries.
    This includes differences creating a DataFrame from different sources (like
    csv vs json vs jsonl) or when a manually crafted DataFrame is provided, such
    as using SafeSynthesizerDataset for testing. Rather than be defensive in every
     model where we use SafeSynthesizerDataset, we do some standardization of
    all DataFrames here.

    Enforced standards, i.e., assumptions models that use SafeSynthesizerDataset may make:
    - Every column has a single datatype, e.g. all float, all str, or all int,
      with the exception of missing values in object columns, where we keep the
      pandas behavior of representing missing with a float numpy.nan for now.

    - Date, time, datetime, and timedelta types are converted to string for
      downstream consistency between tokenization and schema serialization.
      Decimal types are converted to float.
    """
    column_series = {column_name: normalize_column(df[column_name]) for column_name in df.columns}
    return pd.DataFrame(column_series)

`normalize_column(series)` ¶

Normalize the given pandas series.

Parameters:

Name	Type	Description	Default
`series`	`Series`	Series to normalize.	required

Returns:

Type	Description
`Series`	Normalized series.

Source code in src/nemo_safe_synthesizer/data_processing/dataset.py

def normalize_column(series: pd.Series) -> pd.Series:
    """Normalize the given pandas series.

    Args:
        series: Series to normalize.

    Returns:
        Normalized series.
    """
    series_type = pd.api.types.infer_dtype(series, skipna=True)
    if series_type in CONVERT_TO_STR_TYPES:
        return series.astype(str).mask(series.isna(), None)
    if series_type in CONVERT_TO_FLOAT_TYPES:
        return series.astype(float).mask(series.isna(), None)
    return series

dataset

dataset ¶

check_enum_type(series, max_distinct=None, max_singletons=None) ¶

make_json_schema(df, string_length_multiple=STRING_LENGTH_MULTIPLE) ¶

normalize_dataframe(df) ¶

normalize_column(series) ¶

`dataset` ¶

`check_enum_type(series, max_distinct=None, max_singletons=None)` ¶

`make_json_schema(df, string_length_multiple=STRING_LENGTH_MULTIPLE)` ¶

`normalize_dataframe(df)` ¶

`normalize_column(series)` ¶