Skip to content

autoconfig

autoconfig

Resolve "auto" sentinel values in config parameters to concrete values.

Inspects dataset characteristics (token counts, record counts) to replace "auto" placeholders in SafeSynthesizerParameters with computed values for rope scaling factor, number of input records to sample, delta, and other training/privacy parameters.

Classes:

Name Description
AutoConfigResolver

Resolve all "auto" sentinel values in SafeSynthesizerParameters.

Functions:

Name Description
choose_num_input_records_to_sample

Scale training records linearly with the rope scaling factor.

get_max_token_count

Estimate the maximum tokens per training example.

choose_rope_scaling_factor

Compute the RoPE scaling factor from the estimated max token count.

AutoConfigResolver(data, config)

Resolve all "auto" sentinel values in SafeSynthesizerParameters.

Inspects the training dataset to compute concrete values for parameters left as "auto" (rope scaling, number of input records, unsloth, delta, max sequences per example). Resolution order matters: rope_scaling_factor is resolved first because num_input_records_to_sample depends on it.

Parameters:

Name Type Description Default
data DataFrame

Training dataframe used to derive auto parameters.

required
config SafeSynthesizerParameters

Configuration containing "auto" sentinel values to resolve.

required

Methods:

Name Description
resolve

Replace all "auto" parameters with concrete values.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py
def __init__(self, data: pd.DataFrame, config: SafeSynthesizerParameters):
    self._data = data
    self._config = config
    self._record_count = data.shape[0]
    self._delta: float | str | None = config.get("delta")
    self._dp_enabled: bool | None = config.get("dp_enabled")
    self._rope_scaling_factor: int | None = None

resolve()

Replace all "auto" parameters with concrete values.

Resolution order matters: rope_scaling_factor is resolved before num_input_records_to_sample because the latter depends on it.

Returns:

Type Description
SafeSynthesizerParameters

A new SafeSynthesizerParameters with all "auto" values resolved.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py
def resolve(self) -> SafeSynthesizerParameters:
    """Replace all ``"auto"`` parameters with concrete values.

    Resolution order matters: ``rope_scaling_factor`` is resolved before
    ``num_input_records_to_sample`` because the latter depends on it.

    Returns:
        A new ``SafeSynthesizerParameters`` with all ``"auto"`` values resolved.
    """
    # Determine training params (order matters: rope_scaling_factor first)
    training_params: dict[str, Any] = {}
    training_params.update(self._determine_rope_scaling_factor())
    training_params.update(self._determine_num_input_records_to_sample())
    training_params.update(self._determine_use_unsloth())
    training_params.update(self._determine_learning_rate())

    # Determine data params
    data_params: dict[str, Any] = {}
    data_params.update(self._determine_max_sequences_per_example())

    # Determine privacy params
    privacy_params: dict[str, Any] = {}
    privacy_params.update(self._determine_delta())

    return self._build_updated_params(training_params, data_params, privacy_params)

choose_num_input_records_to_sample(rope_scaling_factor)

Scale training records linearly with the rope scaling factor.

num_records = rope_scaling_factor * 25000

Parameters:

Name Type Description Default
rope_scaling_factor int

The RoPE scaling multiplier (1 means no scaling).

required

Returns:

Type Description
int

Number of records to sample for training.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py
def choose_num_input_records_to_sample(rope_scaling_factor: int) -> int:
    """Scale training records linearly with the rope scaling factor.

     ``num_records = rope_scaling_factor * 25000``

    Args:
        rope_scaling_factor: The RoPE scaling multiplier (1 means no scaling).

    Returns:
        Number of records to sample for training.
    """
    return rope_scaling_factor * 25_000

get_max_token_count(data, group_by)

Estimate the maximum tokens per training example.

Accounts for prompt overhead (~40 tokens), column names (repeated in JSON formatting), and content character counts. Digits are counted as one token each; other characters use a 4-chars-per-token heuristic (Llama-2 tokenizer). Samples up to 5,000 records from data for analysis.

Parameters:

Name Type Description Default
data DataFrame

Training dataframe to analyze.

required
group_by list[str] | str | None

Column(s) used to group records into single training examples. When set, grouped records are concatenated before token estimation.

required

Returns:

Type Description
int

Estimated maximum token count across all sampled training examples,

int

or 1 if the dataframe is empty.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py
def get_max_token_count(data: pd.DataFrame, group_by: list[str] | str | None) -> int:
    """Estimate the maximum tokens per training example.

    Accounts for prompt overhead (~40 tokens), column names (repeated in JSON
    formatting), and content character counts. Digits are counted as one token
    each; other characters use a 4-chars-per-token heuristic (Llama-2 tokenizer).
    Samples up to 5,000 records from ``data`` for analysis.

    Args:
        data: Training dataframe to analyze.
        group_by: Column(s) used to group records into single training examples.
            When set, grouped records are concatenated before token estimation.

    Returns:
        Estimated maximum token count across all sampled training examples,
        or 1 if the dataframe is empty.
    """
    if data.size == 0:
        return 1

    # Limit to 5k records to keep run time under 5 seconds
    if data.shape[0] > 5000:
        if group_by:
            # Sort by group_by so that we don't break up groups
            data = data.sort_values(group_by)
        data = data.head(5000)

    counts = pd.DataFrame()
    # Estimate the character count introduced by the column names,
    # counting the characters separately for digits and other
    title = " ".join(data.columns)
    title_text = re.sub(r"\d", "", title)
    title_text_char_count = len(title_text)
    title_num_char_count = len(title) - len(title_text)

    # Estimate the character count introduced by each record in the dataset
    counts["content"] = data.apply(lambda x: " ".join([str(x[col]) for col in data.columns]), axis=1)
    if group_by:
        # Concatenate the content of all records with the same group_by value,
        # and count the number of records in each group
        counts[group_by] = data[group_by]
        grouped_counts = counts.groupby(group_by)["content"].apply(lambda x: "\n".join(x)).to_frame()
        grouped_counts.reset_index(inplace=True)
        grouped_counts["num_rows"] = counts.groupby(group_by).size().values
        counts = grouped_counts
    else:
        counts["num_rows"] = 1

    counts["content_text"] = counts["content"].apply(lambda x: re.sub(r"\d.", "", x))
    counts["content_text_char_count"] = counts["content_text"].apply(lambda x: len(x))
    counts["content_num_char_count"] = counts.apply(lambda x: len(x["content"]) - len(x["content_text"]), axis=1)

    # Estimate the token count from the character count
    # For numbers, every digit is one token; for the rest, we estimate 4 characters per token
    # This is assuming we use TinyLlama, which uses the Llama-2 tokenizer
    counts["estimated_content_token_count"] = counts["content_text_char_count"] / 4 + counts["content_num_char_count"]
    estimated_title_token_count = title_text_char_count / 4 + title_num_char_count

    # Get the token count of the assembled example
    num_columns = data.shape[1]
    # These coefficients are estimated using a linear mixed effects model
    # based on a small number of real or simulated datasets
    counts["num_tokens"] = (
        40  # Roughly accounts for the prompt
        + counts["estimated_content_token_count"]
        # Column names are used twice in the json, plus some json formatting
        + (2 + 0.5 * counts["num_rows"]) * estimated_title_token_count
        # Roughly accounts for the json formatting
        + 4 * num_columns * counts["num_rows"]
    )

    max_token_count = counts.num_tokens.max()
    logger.info(
        f"Estimated max token count for examples in dataset - this is used to determine the rope scaling factor: {max_token_count}"
    )
    return max_token_count

choose_rope_scaling_factor(max_token_count, context_length=DEFAULT_MAX_SEQ_LENGTH)

Compute the RoPE scaling factor from the estimated max token count.

Divides max_token_count by context_length, rounds up, and caps the result at MAX_ROPE_SCALING_FACTOR.

Parameters:

Name Type Description Default
max_token_count int

Estimated maximum tokens per training example.

required
context_length int

Base context window size (default DEFAULT_MAX_SEQ_LENGTH).

DEFAULT_MAX_SEQ_LENGTH

Returns:

Type Description
int

Integer scaling factor in the range [1, MAX_ROPE_SCALING_FACTOR].

Source code in src/nemo_safe_synthesizer/config/autoconfig.py
def choose_rope_scaling_factor(max_token_count: int, context_length: int = DEFAULT_MAX_SEQ_LENGTH) -> int:
    """Compute the RoPE scaling factor from the estimated max token count.

    Divides ``max_token_count`` by ``context_length``, rounds up, and
    caps the result at ``MAX_ROPE_SCALING_FACTOR``.

    Args:
        max_token_count: Estimated maximum tokens per training example.
        context_length: Base context window size (default ``DEFAULT_MAX_SEQ_LENGTH``).

    Returns:
        Integer scaling factor in the range [1, ``MAX_ROPE_SCALING_FACTOR``].
    """
    rope_scaling_factor = math.ceil(max_token_count / context_length)
    rope_scaling_factor = min(rope_scaling_factor, MAX_ROPE_SCALING_FACTOR)

    return rope_scaling_factor