utils
utils
¶
Shared utilities for Safe Synthesizer.
Provides schema prompt creation, statistics logging, file I/O helpers, data loading, and general-purpose functions used across the pipeline.
Functions:
| Name | Description |
|---|---|
create_schema_prompt |
Create the schema prompt from column names and a template. |
get_random_number_generator |
Return a random number generator with the given seed. |
log_stats |
Log aggregated statistics as a structured table. |
log_training_example_stats |
Log training example statistics from the given dictionary. |
round_number_if_float |
Round the number to the given precision if it is a float. |
smart_read_table |
Load tabular data from a file path, or return an existing DataFrame. |
time_function |
Decorator to log the time taken by a function to execute. |
grouped_train_test_split |
Split a HuggingFace Dataset preserving group membership. |
debug_fmt |
Format dataframes for the purposes of data actions debugging. |
merge_dicts |
Deep-merge two dicts, preferring values from |
is_iterable |
Check whether |
flatten |
Flatten a possibly nested iterable. |
all_equal_type |
Check whether every element in an iterable is an instance of |
write_json |
Write a dictionary to a JSON file, creating parent directories as needed. |
load_json |
Load JSON file and return the content as a dict. |
create_schema_prompt(columns, instruction, prompt_template, prefill='', exclude_columns=None)
¶
Create the schema prompt from column names and a template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
columns
|
list[str]
|
List of column names to include in the schema. |
required |
instruction
|
str
|
Instruction text placed before the schema. |
required |
prompt_template
|
str
|
Template string with |
required |
prefill
|
str
|
Optional text appended after the schema. |
''
|
exclude_columns
|
list[str] | None
|
Column names to omit from the schema. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The formatted prompt string. |
Source code in src/nemo_safe_synthesizer/utils.py
get_random_number_generator(seed)
¶
log_stats(stats, headers=None, title=None)
¶
Log aggregated statistics as a structured table.
Console output is rendered as a Rich ASCII table by the structlog processor; JSON logs receive structured key/value pairs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stats
|
Statistics | list[Statistics]
|
One or more |
required |
headers
|
list[str] | None
|
Column headers (one per |
None
|
title
|
str | None
|
Optional table title. |
None
|
Source code in src/nemo_safe_synthesizer/utils.py
log_training_example_stats(stats_dict, **kwargs)
¶
Log training example statistics from the given dictionary.
Source code in src/nemo_safe_synthesizer/utils.py
round_number_if_float(number, precision=3)
¶
Round the number to the given precision if it is a float.
smart_read_table(df_or_path)
¶
Load tabular data from a file path, or return an existing DataFrame.
Supported formats: CSV, JSON, JSONL, and Parquet.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_or_path
|
str | Path | DataFrame
|
A |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The loaded (or passed-through) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the file extension is not supported. |
Source code in src/nemo_safe_synthesizer/utils.py
time_function(func)
¶
Decorator to log the time taken by a function to execute.
Source code in src/nemo_safe_synthesizer/utils.py
grouped_train_test_split(dataset, test_size, group_by, seed=None)
¶
Split a HuggingFace Dataset preserving group membership.
Currently unused. Converts the dataset to a pandas DataFrame and
delegates to holdout.grouped_train_test_split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
The HuggingFace |
required |
test_size
|
float
|
Fraction or absolute number of test rows. |
required |
group_by
|
str | list[str]
|
Column name or list of column names defining groups. |
required |
seed
|
int | None
|
Random state for reproducibility. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Tuple of |
DataFrame | None
|
failure. |
Source code in src/nemo_safe_synthesizer/utils.py
debug_fmt(df)
¶
merge_dicts(base, new)
¶
Deep-merge two dicts, preferring values from new on conflict.
Source code in src/nemo_safe_synthesizer/utils.py
is_iterable(x)
¶
flatten(iter)
¶
Flatten a possibly nested iterable.
Strings are yielded as-is (not broken into characters). Dicts are yielded whole with a warning since flattening them is not meaningful.
Source code in src/nemo_safe_synthesizer/utils.py
all_equal_type(iter, type_, flatten_iter=True)
¶
Check whether every element in an iterable is an instance of type_.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iter
|
The iterable to check. |
required | |
type_
|
The type to check against. |
required | |
flatten_iter
|
If |
True
|
Source code in src/nemo_safe_synthesizer/utils.py
write_json(data, path, encoding=None, indent=None)
¶
Write a dictionary to a JSON file, creating parent directories as needed.
Source code in src/nemo_safe_synthesizer/utils.py
load_json(path, encoding=None)
¶
Load JSON file and return the content as a dict.