Skip to content

data

data

Classes:

Name Description
DataParameters

Configuration for grouping, ordering, and splitting input data for training and evaluation.

DataParameters pydantic-model

Bases: Parameters

Configuration for grouping, ordering, and splitting input data for training and evaluation.

Fields:

Validators:

group_training_examples_by pydantic-field

Column to group training examples by. This is useful when you want the model to learn inter-record correlations for a given grouping of records.

order_training_examples_by pydantic-field

Column to order training examples by. This is useful when you want the model to learn sequential relationships for a given ordering of records. If you provide this parameter, you must also provide group_training_examples_by.

max_sequences_per_example pydantic-field

If specified, adds at most this number of sequences per example. Supports 'auto' where a value of 1 is chosen if differential privacy is enabled, and 10 otherwise. If not specified or set to 'auto', fills up context. Required for DP to limit contribution of each example.

holdout pydantic-field

Amount of records to hold out for evaluation. If this is a float between 0 and 1, that ratio of records is held out. If an integer greater than 1, that number of records is held out. If the value is equal to zero, no holdout will be performed. Must be >= 0.

max_holdout pydantic-field

Maximum number of records to hold out. Overrides any behavior set by holdout. Must be >= 0.

random_state pydantic-field

Random state for holdout split to ensure reproducibility.

set_random_state_if_none(v) pydantic-validator

Generate a random state if none was provided.

Source code in src/nemo_safe_synthesizer/config/data.py
@field_validator("random_state", mode="after", check_fields=False)
def set_random_state_if_none(cls, v: int | int | None) -> int | None:
    """Generate a random state if none was provided."""
    import random

    if v is None:
        return random.randint(0, 1000000)
    return v