dataset
dataset
¶
DataFrame normalization and JSON schema inference for training data.
Provides utilities for standardizing DataFrames (type coercion, missing-value handling) and deriving JSON schemas used for validating generated records.
Functions:
| Name | Description |
|---|---|
check_enum_type |
Return enum schema if the series is an enum, otherwise return None. |
make_json_schema |
Generate a JSON schema from the given DataFrame. |
normalize_dataframe |
Ensure DataFrame meets standards for use in Safe Synthesizer models. |
normalize_column |
Normalize the given pandas series. |
check_enum_type(series, max_distinct=None, max_singletons=None)
¶
Return enum schema if the series is an enum, otherwise return None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
Data object to check for enum type. |
required |
max_distinct
|
int | float | None
|
Maximum number of distinct values to be considered an enum. |
None
|
max_singletons
|
int | float | None
|
Maximum number of values with a single occurrence to be considered an enum. |
None
|
Returns:
| Type | Description |
|---|---|
dict | None
|
The enum schema if the series is an enum, otherwise None. |
Source code in src/nemo_safe_synthesizer/data_processing/dataset.py
make_json_schema(df, string_length_multiple=STRING_LENGTH_MULTIPLE)
¶
Generate a JSON schema from the given DataFrame.
Inspects each column to determine its JSON type, numeric range, string length bounds, or enum values. See https://json-schema.org for the schema specification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to derive a JSON schema from. |
required |
string_length_multiple
|
float
|
Multiplier applied to observed string lengths
to set |
STRING_LENGTH_MULTIPLE
|
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representing the JSON schema with |
dict
|
and |
Source code in src/nemo_safe_synthesizer/data_processing/dataset.py
normalize_dataframe(df)
¶
Ensure DataFrame meets standards for use in Safe Synthesizer models.
Pandas may be used to construct a lot of odd DataFrames with weird corner cases that can violate assumptions of downstream code and other libraries. This includes differences creating a DataFrame from different sources (like csv vs json vs jsonl) or when a manually crafted DataFrame is provided, such as using SafeSynthesizerDataset for testing. Rather than be defensive in every model where we use SafeSynthesizerDataset, we do some standardization of all DataFrames here.
Enforced standards, i.e., assumptions models that use SafeSynthesizerDataset may make: - Every column has a single datatype, e.g. all float, all str, or all int, with the exception of missing values in object columns, where we keep the pandas behavior of representing missing with a float numpy.nan for now.
- Date, time, datetime, and timedelta types are converted to string for downstream consistency between tokenization and schema serialization. Decimal types are converted to float.
Source code in src/nemo_safe_synthesizer/data_processing/dataset.py
normalize_column(series)
¶
Normalize the given pandas series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
series
|
Series
|
Series to normalize. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
Normalized series. |