record_utils
record_utils
¶
Utilities for extracting, validating, and converting JSONL records.
Provides regex-based JSONL extraction, JSON-schema validation (including time-series interval checks), DataFrame normalization, and JSONL serialization.
Classes:
| Name | Description |
|---|---|
ParsedRecord |
A single record extracted from an LLM completion. |
ParsedResponse |
Parsed result of a single LLM prompt response. |
Functions:
| Name | Description |
|---|---|
is_safe_for_float_conversion |
Check if a value can be safely converted to float64 without overflow. |
check_record_for_large_numbers |
Check if a record contains any numbers that would cause float64 overflow. |
check_if_records_are_ordered |
Check if the records are in ascending order based on the given |
extract_records_from_jsonl_string |
Extract and return tabular records from the given JSONL string. |
extract_groups_from_jsonl_string |
Extract groups of records from the given JSONL string. |
timed_encode |
Wrap an encode callable with timing, or return a no-op. |
extract_and_validate_records |
Extract and validate records from the given JSONL string. |
extract_and_validate_timeseries_records |
Extract and validate sequential records with time-interval constraints. |
normalize_dataframe |
Normalize a DataFrame of generated records via a CSV round-trip. |
records_to_jsonl |
Convert list of records to a JSONL string. |
ParsedRecord(text, parsed=None, error=None, token_count=0)
dataclass
¶
A single record extracted from an LLM completion.
Validity is tracked by the invariant that exactly one of parsed
and error is non-None: a valid record has parsed set and
error as None, an invalid record has error set and
parsed as None.
is_valid
is the canonical accessor.
text and token_count are captured at extraction time and
remain invariant even if the record is reclassified later (e.g. by
group-level checks or data-fidelity filters) via
invalidate.
Methods:
| Name | Description |
|---|---|
invalidate |
Reclassify this record as invalid. |
Attributes:
| Name | Type | Description |
|---|---|---|
text |
str
|
Original regex-matched JSON string (invariant under reclassification). |
parsed |
dict | None
|
Parsed dict when validation succeeded, |
error |
tuple[str, str] | None
|
|
token_count |
int
|
Number of tokens in |
is_valid |
bool
|
Return |
text
instance-attribute
¶
Original regex-matched JSON string (invariant under reclassification).
parsed = None
class-attribute
instance-attribute
¶
Parsed dict when validation succeeded, None when invalid.
error = None
class-attribute
instance-attribute
¶
(detailed_msg, validator) when invalid, None when valid.
token_count = 0
class-attribute
instance-attribute
¶
Number of tokens in text; 0 when no tokenizer was provided.
is_valid
property
¶
Return True when this record passed validation.
invalidate(error)
¶
Reclassify this record as invalid.
text and token_count are kept intact; parsed is
cleared so downstream consumers don't accidentally use a stale
dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
error
|
tuple[str, str]
|
|
required |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
ParsedResponse(records=list(), tokenization_time_sec=0.0, prompt_number=None)
dataclass
¶
Parsed result of a single LLM prompt response.
Holds a flat list of
ParsedRecord
objects (in input order) plus aggregated tokenization timing.
valid_records / invalid_records / errors are convenience
views that project the record list into the shapes expected by
downstream aggregation code (parsed dicts, original text,
(msg, validator) tuples respectively).
Attributes:
| Name | Type | Description |
|---|---|---|
records |
list[ParsedRecord]
|
Per-record extraction + validation outcomes, in input order. |
tokenization_time_sec |
float
|
Wall-clock seconds spent tokenizing records in this response. |
prompt_number |
int | None
|
Index of the prompt within the batch (set by the processor call). |
valid_records |
list[dict]
|
Parsed dicts for records that passed validation. |
invalid_records |
list[str]
|
Original text for records that failed validation. |
errors |
list[tuple[str, str]]
|
|
records = field(default_factory=list)
class-attribute
instance-attribute
¶
Per-record extraction + validation outcomes, in input order.
tokenization_time_sec = 0.0
class-attribute
instance-attribute
¶
Wall-clock seconds spent tokenizing records in this response.
prompt_number = None
class-attribute
instance-attribute
¶
Index of the prompt within the batch (set by the processor call).
valid_records
property
¶
Parsed dicts for records that passed validation.
invalid_records
property
¶
Original text for records that failed validation.
errors
property
¶
(detailed_msg, validator) tuples for each invalid record.
is_safe_for_float_conversion(value)
¶
Check if a value can be safely converted to float64 without overflow.
Only int values can cause overflow; all other types are considered safe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
str | int | float | None | list | dict
|
The value to check. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value can be safely converted to float64, False otherwise. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
check_record_for_large_numbers(record)
¶
Check if a record contains any numbers that would cause float64 overflow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record
|
dict
|
Dictionary of field names to values. |
required |
Returns:
| Type | Description |
|---|---|
str | None
|
An error message describing the first unsafe value found, |
str | None
|
or None if all values are safe. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
check_if_records_are_ordered(records, order_by)
¶
Check if the records are in ascending order based on the given order_by column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[dict]
|
List of of JSONL records. |
required |
order_by
|
str
|
Column to check for ordering. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the records are ordered by the given column, otherwise False. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_records_from_jsonl_string(jsonl_string)
¶
Extract and return tabular records from the given JSONL string.
extract_groups_from_jsonl_string(jsonl_string, bos, eos)
¶
Extract groups of records from the given JSONL string.
This function assumes that the complete group of records is enclosed by the given beginning-of-sequence (bos) and end-of-sequence (eos) tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
Single JSONL string containing grouped tabular records. |
required |
bos
|
str
|
Beginning-of-sequence token used to identify the start of a group. |
required |
eos
|
str
|
End-of-sequence token used to identify the end of a group. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Substrings matching complete bos/eos-delimited record groups. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
timed_encode(encode)
¶
Wrap an encode callable with timing, or return a no-op.
Returns a function timed(text) that returns (n_tokens,
elapsed_seconds). When encode is None the returned
function always returns (0, 0.0).
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_and_validate_records(jsonl_string, schema, encode=None)
¶
Extract and validate records from the given JSONL string.
Each regex-matched JSON string is tokenized (when encode is provided) before validation so that exact token counts are available for every record regardless of later reclassification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
Single JSONL string containing tabular records. |
required |
schema
|
dict
|
JSON schema as a dictionary. |
required |
encode
|
Callable[[str], list[int]] | None
|
Optional tokenizer encode callable. When provided,
each matched record string is tokenized and its token count
is stored on the corresponding
|
None
|
Returns:
| Type | Description |
|---|---|
ParsedResponse
|
A |
ParsedResponse
|
|
ParsedResponse
|
whose |
ParsedResponse
|
for valid records and |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
extract_and_validate_timeseries_records(jsonl_string, schema, time_column, interval_seconds, time_format, encode=None)
¶
Extract and validate sequential records with time-interval constraints.
Each regex-matched JSON string is tokenized (when encode is provided) before validation so that exact token counts are captured for both validated and cascade-invalidated records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
jsonl_string
|
str
|
JSONL string containing series data. |
required |
schema
|
dict
|
JSON schema describing the records. |
required |
time_column
|
str
|
Column containing the timestamp used for interval validation. |
required |
interval_seconds
|
int | None
|
Expected interval in seconds between
consecutive timestamps. When |
required |
time_format
|
str
|
Format of the timestamp column (required). |
required |
encode
|
Callable[[str], list[int]] | None
|
Optional tokenizer encode callable. When provided,
each matched record string is tokenized and its token count
is stored on the corresponding
|
None
|
Returns:
| Type | Description |
|---|---|
ParsedResponse
|
A |
ParsedResponse
|
|
ParsedResponse
|
in input order. Once a record fails, every subsequent record is |
ParsedResponse
|
marked invalid with a cascade error. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 | |
normalize_dataframe(dataframe)
¶
Normalize a DataFrame of generated records via a CSV round-trip.
Serializes to CSV and reads back to standardize missing-value representations (NaN/None/NA) across mixed-type columns. Falls back to ignoring encoding errors if the initial round-trip fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataframe
|
DataFrame
|
DataFrame to normalize. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with missing values normalized and invalid UTF-8 characters |
DataFrame
|
dropped. |
Source code in src/nemo_safe_synthesizer/data_processing/record_utils.py
records_to_jsonl(records)
¶
Convert list of records to a JSONL string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
DataFrame | list[dict] | dict
|
DataFrame, list of records, or dict. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The JSONL string. |