Skip to content

timeseries_preprocessing

timeseries_preprocessing

Time series preprocessing utilities for Safe Synthesizer training.

Functions:

Name Description
process_timeseries_data

Process time series data and validate/infer timestamp parameters.

process_timeseries_data(df_all, config)

Process time series data and validate/infer timestamp parameters.

This function: 1. Creates a timestamp column if one doesn't exist 2. Validates the timestamp column exists and has no missing values 3. Sorts the data by timestamp 4. Infers timestamp_format from the data 5. Validates or infers timestamp_interval_seconds 6. Sets start_timestamp and stop_timestamp

Parameters:

Name Type Description Default
df_all DataFrame

The input DataFrame

required
config SafeSynthesizerParameters

The configuration object with time_series settings

required

Returns:

Type Description
tuple[DataFrame, SafeSynthesizerParameters]

Tuple of (processed DataFrame, updated config)

Raises:

Type Description
ParameterError

If timestamp column is not found

DataError

If timestamp column has missing values or intervals are inconsistent

Source code in src/nemo_safe_synthesizer/training/timeseries_preprocessing.py
def process_timeseries_data(
    df_all: pd.DataFrame,
    config: SafeSynthesizerParameters,
) -> tuple[pd.DataFrame, SafeSynthesizerParameters]:
    """Process time series data and validate/infer timestamp parameters.

    This function:
    1. Creates a timestamp column if one doesn't exist
    2. Validates the timestamp column exists and has no missing values
    3. Sorts the data by timestamp
    4. Infers timestamp_format from the data
    5. Validates or infers timestamp_interval_seconds
    6. Sets start_timestamp and stop_timestamp

    Args:
        df_all: The input DataFrame
        config: The configuration object with time_series settings

    Returns:
        Tuple of (processed DataFrame, updated config)

    Raises:
        ParameterError: If timestamp column is not found
        DataError: If timestamp column has missing values or intervals are inconsistent
    """
    ts_config = config.time_series

    # Step 1: Add pseudo-group if needed
    df_all, group_by_col = _add_pseudo_group_if_needed(df_all, config)

    if group_by_col is None:
        raise RuntimeError("group_by_col should have been set by _add_pseudo_group_if_needed")

    # Step 2: Create elapsed time column if timestamp not provided
    df_all, is_elapsed_time = _create_elapsed_time_column(df_all, ts_config, group_by_col)

    # timestamp_column should be set by now
    if ts_config.timestamp_column is None:
        raise RuntimeError("timestamp_column should have been set by _create_elapsed_time_column")
    config.data.order_training_examples_by = ts_config.timestamp_column

    # Step 3: Validate timestamp column
    _validate_timestamp_column(df_all, ts_config.timestamp_column)

    # Step 4: Sort by group and timestamp
    df_all = _sort_by_group_and_timestamp(df_all, group_by_col, ts_config.timestamp_column)

    # Step 5: Infer format and convert to datetime (if not elapsed time)
    # Skip datetime conversion for elapsed_seconds format (either created or user-provided)
    if not is_elapsed_time and ts_config.timestamp_format != "elapsed_seconds":
        df_all = _infer_and_convert_timestamp_format(df_all, ts_config)

    # Step 6: Process groups and validate consistency
    ts_config = _process_grouped_timestamps(df_all, ts_config, group_by_col, is_elapsed_time)

    # Step 7: Convert timestamp back to string format
    # Skip string conversion for elapsed_seconds format (values are already numeric)
    if (
        not is_elapsed_time
        and ts_config.timestamp_format is not None
        and ts_config.timestamp_format != "elapsed_seconds"
    ):
        df_all[ts_config.timestamp_column] = df_all[ts_config.timestamp_column].dt.strftime(ts_config.timestamp_format)

    return df_all, config