record_utils
record_utils
¶
Utilities for extracting, validating, and converting JSONL records.
Provides regex-based JSONL extraction, JSON-schema validation (including time-series interval checks), DataFrame normalization, and JSONL serialization.
Functions:
| Name | Description |
|---|---|
is_safe_for_float_conversion |
Check if a value can be safely converted to float64 without overflow. |
check_record_for_large_numbers |
Check if a record contains any numbers that would cause float64 overflow. |
check_if_records_are_ordered |
Check if the records are in ascending order based on the given |
extract_records_from_jsonl_string |
Extract and return tabular records from the given JSONL string. |
extract_groups_from_jsonl_string |
Extract groups of records from the given JSONL string. |
extract_and_validate_records |
Extract and validate records from the given JSONL string. |
extract_and_validate_timeseries_records |
Extract and validate sequential records with enforced time interval constraints. |
normalize_dataframe |
Normalize a DataFrame of generated records via a CSV round-trip. |
records_to_jsonl |
Convert list of records to a JSONL string. |
is_safe_for_float_conversion(value)
¶
Check if a value can be safely converted to float64 without overflow.
Only int values can cause overflow; all other types are considered safe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
str | int | float | None | list | dict
|
The value to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value can be safely converted to float64, False otherwise. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
check_record_for_large_numbers(record)
¶
Check if a record contains any numbers that would cause float64 overflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record
|
dict
|
Dictionary of field names to values. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
An error message describing the first unsafe value found, |
str | None
|
or None if all values are safe. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
check_if_records_are_ordered(records, order_by)
¶
Check if the records are in ascending order based on the given order_by column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[dict]
|
List of of JSONL records. |
required |
order_by
|
str
|
Column to check for ordering. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the records are ordered by the given column, otherwise False. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_records_from_jsonl_string(jsonl_string)
¶
Extract and return tabular records from the given JSONL string.
extract_groups_from_jsonl_string(jsonl_string, bos, eos)
¶
Extract groups of records from the given JSONL string.
This function assumes that the complete group of records is enclosed by the given beginning-of-sequence (bos) and end-of-sequence (eos) tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
Single JSONL string containing grouped tabular records. |
required |
bos
|
str
|
Beginning-of-sequence token used to identify the start of a group. |
required |
eos
|
str
|
End-of-sequence token used to identify the end of a group. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Substrings matching complete bos/eos-delimited record groups. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_and_validate_records(jsonl_string, schema)
¶
Extract and validate records from the given JSONL string.
The records are validated against the given schema using jsonschema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
Single JSONL string containing tabular records. |
required |
schema
|
dict
|
JSON schema as a dictionary. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
valid_records |
list[dict]
|
List of valid records. |
invalid_records |
list[str]
|
List of invalid records. |
invalid_record_errors |
list[tuple[str, str]]
|
List of errors for invalid records, each a (message, validator) tuple. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_and_validate_timeseries_records(jsonl_string, schema, time_column, interval_seconds, time_format)
¶
Extract and validate sequential records with enforced time interval constraints.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
JSONL string containing series data. |
required |
schema
|
dict
|
JSON schema describing the records. |
required |
time_column
|
str
|
Column containing the timestamp used for interval validation. |
required |
interval_seconds
|
int | None
|
(Optional) Expected interval in seconds between consecutive timestamps. If not provided, no time interval validation is performed. |
required |
time_format
|
str
|
Format of the timestamp column (required, should be set from config). |
required |
Returns:
| Type | Description |
|---|---|
tuple[list[dict], list[str], list[tuple[str, str]]]
|
Tuple of valid records, invalid record strings, and their associated errors. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 | |
normalize_dataframe(dataframe)
¶
Normalize a DataFrame of generated records via a CSV round-trip.
Serializes to CSV and reads back to standardize missing-value representations (NaN/None/NA) across mixed-type columns. Falls back to ignoring encoding errors if the initial round-trip fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe
|
DataFrame
|
DataFrame to normalize. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with missing values normalized and invalid UTF-8 characters |
DataFrame
|
dropped. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
records_to_jsonl(records)
¶
Convert list of records to a JSONL string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
DataFrame | list[dict] | dict
|
DataFrame, list of records, or dict. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The JSONL string. |