autoconfig

`autoconfig` ¶

Resolve "auto" sentinel values in config parameters to concrete values.

Inspects dataset characteristics (token counts, record counts) to replace "auto" placeholders in SafeSynthesizerParameters with computed values for rope scaling factor, number of input records to sample, delta, and other training/privacy parameters.

Classes:

Name	Description
`AutoConfigResolver`	Resolve all `"auto"` sentinel values in `SafeSynthesizerParameters`.

Functions:

Name	Description
`choose_num_input_records_to_sample`	Scale training records linearly with the rope scaling factor.
`get_max_token_count`	Estimate the maximum tokens per training example.
`choose_rope_scaling_factor`	Compute the RoPE scaling factor from the estimated max token count.

`AutoConfigResolver(data, config)` ¶

Resolve all "auto" sentinel values in SafeSynthesizerParameters.

Inspects the training dataset to compute concrete values for parameters left as "auto" (rope scaling, number of input records, unsloth, delta, max sequences per example). Resolution order matters: rope_scaling_factor is resolved first because num_input_records_to_sample depends on it.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Training dataframe used to derive auto parameters.	required
`config`	`SafeSynthesizerParameters`	Configuration containing `"auto"` sentinel values to resolve.	required

Methods:

Name	Description
`resolve`	Replace all `"auto"` parameters with concrete values.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py

def __init__(self, data: pd.DataFrame, config: SafeSynthesizerParameters):
    self._data = data
    self._config = config
    self._record_count = data.shape[0]
    self._delta: float | str | None = config.get("delta")
    self._dp_enabled: bool | None = config.get("dp_enabled")
    self._rope_scaling_factor: int | None = None

`resolve()` ¶

Replace all "auto" parameters with concrete values.

Resolution order matters: rope_scaling_factor is resolved before num_input_records_to_sample because the latter depends on it.

Returns:

Type	Description
`SafeSynthesizerParameters`	A new `SafeSynthesizerParameters` with all `"auto"` values resolved.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py

def resolve(self) -> SafeSynthesizerParameters:
    """Replace all ``"auto"`` parameters with concrete values.

    Resolution order matters: ``rope_scaling_factor`` is resolved before
    ``num_input_records_to_sample`` because the latter depends on it.

    Returns:
        A new ``SafeSynthesizerParameters`` with all ``"auto"`` values resolved.
    """
    # Determine training params (order matters: rope_scaling_factor first)
    training_params: dict[str, Any] = {}
    training_params.update(self._determine_rope_scaling_factor())
    training_params.update(self._determine_num_input_records_to_sample())
    training_params.update(self._determine_use_unsloth())
    training_params.update(self._determine_learning_rate())

    # Determine data params
    data_params: dict[str, Any] = {}
    data_params.update(self._determine_max_sequences_per_example())

    # Determine privacy params
    privacy_params: dict[str, Any] = {}
    privacy_params.update(self._determine_delta())

    return self._build_updated_params(training_params, data_params, privacy_params)

`choose_num_input_records_to_sample(rope_scaling_factor)` ¶

Scale training records linearly with the rope scaling factor.

num_records = rope_scaling_factor * 25000

Parameters:

Name	Type	Description	Default
`rope_scaling_factor`	`int`	The RoPE scaling multiplier (1 means no scaling).	required

Returns:

Type	Description
`int`	Number of records to sample for training.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py

def choose_num_input_records_to_sample(rope_scaling_factor: int) -> int:
    """Scale training records linearly with the rope scaling factor.

     ``num_records = rope_scaling_factor * 25000``

    Args:
        rope_scaling_factor: The RoPE scaling multiplier (1 means no scaling).

    Returns:
        Number of records to sample for training.
    """
    return rope_scaling_factor * 25_000

`get_max_token_count(data, group_by)` ¶

Estimate the maximum tokens per training example.

Accounts for prompt overhead (~40 tokens), column names (repeated in JSON formatting), and content character counts. Digits are counted as one token each; other characters use a 4-chars-per-token heuristic (Llama-2 tokenizer). Samples up to 5,000 records from data for analysis.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Training dataframe to analyze.	required
`group_by`	`list[str] \| str \| None`	Column(s) used to group records into single training examples. When set, grouped records are concatenated before token estimation.	required

Returns:

Type	Description
`int`	Estimated maximum token count across all sampled training examples,
`int`	or 1 if the dataframe is empty.

Source code in src/nemo_safe_synthesizer/config/autoconfig.py

def get_max_token_count(data: pd.DataFrame, group_by: list[str] | str | None) -> int:
    """Estimate the maximum tokens per training example.

    Accounts for prompt overhead (~40 tokens), column names (repeated in JSON
    formatting), and content character counts. Digits are counted as one token
    each; other characters use a 4-chars-per-token heuristic (Llama-2 tokenizer).
    Samples up to 5,000 records from ``data`` for analysis.

    Args:
        data: Training dataframe to analyze.
        group_by: Column(s) used to group records into single training examples.
            When set, grouped records are concatenated before token estimation.

    Returns:
        Estimated maximum token count across all sampled training examples,
        or 1 if the dataframe is empty.
    """
    if data.size == 0:
        return 1

    # Limit to 5k records to keep run time under 5 seconds
    if data.shape[0] > 5000:
        if group_by:
            # Sort by group_by so that we don't break up groups
            data = data.sort_values(group_by)
        data = data.head(5000)

    counts = pd.DataFrame()
    # Estimate the character count introduced by the column names,
    # counting the characters separately for digits and other
    title = " ".join(data.columns)
    title_text = re.sub(r"\d", "", title)
    title_text_char_count = len(title_text)
    title_num_char_count = len(title) - len(title_text)

    # Estimate the character count introduced by each record in the dataset
    counts["content"] = data.apply(lambda x: " ".join([str(x[col]) for col in data.columns]), axis=1)
    if group_by:
        # Concatenate the content of all records with the same group_by value,
        # and count the number of records in each group
        counts[group_by] = data[group_by]
        grouped_counts = counts.groupby(group_by)["content"].apply(lambda x: "\n".join(x)).to_frame()
        grouped_counts.reset_index(inplace=True)
        grouped_counts["num_rows"] = counts.groupby(group_by).size().values
        counts = grouped_counts
    else:
        counts["num_rows"] = 1

    counts["content_text"] = counts["content"].apply(lambda x: re.sub(r"\d.", "", x))
    counts["content_text_char_count"] = counts["content_text"].apply(lambda x: len(x))
    counts["content_num_char_count"] = counts.apply(lambda x: len(x["content"]) - len(x["content_text"]), axis=1)

    # Estimate the token count from the character count
    # For numbers, every digit is one token; for the rest, we estimate 4 characters per token
    # This is assuming we use TinyLlama, which uses the Llama-2 tokenizer
    counts["estimated_content_token_count"] = counts["content_text_char_count"] / 4 + counts["content_num_char_count"]
    estimated_title_token_count = title_text_char_count / 4 + title_num_char_count

    # Get the token count of the assembled example
    num_columns = data.shape[1]
    # These coefficients are estimated using a linear mixed effects model
    # based on a small number of real or simulated datasets
    counts["num_tokens"] = (
        40  # Roughly accounts for the prompt
        + counts["estimated_content_token_count"]
        # Column names are used twice in the json, plus some json formatting
        + (2 + 0.5 * counts["num_rows"]) * estimated_title_token_count
        # Roughly accounts for the json formatting
        + 4 * num_columns * counts["num_rows"]
    )

    max_token_count = counts.num_tokens.max()
    logger.info(
        f"Estimated max token count for examples in dataset - this is used to determine the rope scaling factor: {max_token_count}"
    )
    return max_token_count

`choose_rope_scaling_factor(max_token_count, context_length=DEFAULT_MAX_SEQ_LENGTH)` ¶

Compute the RoPE scaling factor from the estimated max token count.

Divides max_token_count by context_length, rounds up, and caps the result at MAX_ROPE_SCALING_FACTOR.

Parameters:

Name	Type	Description	Default
`max_token_count`	`int`	Estimated maximum tokens per training example.	required
`context_length`	`int`	Base context window size (default `DEFAULT_MAX_SEQ_LENGTH`).	`DEFAULT_MAX_SEQ_LENGTH`

Returns:

Type	Description
`int`	Integer scaling factor in the range [1, `MAX_ROPE_SCALING_FACTOR`].

Source code in src/nemo_safe_synthesizer/config/autoconfig.py

def choose_rope_scaling_factor(max_token_count: int, context_length: int = DEFAULT_MAX_SEQ_LENGTH) -> int:
    """Compute the RoPE scaling factor from the estimated max token count.

    Divides ``max_token_count`` by ``context_length``, rounds up, and
    caps the result at ``MAX_ROPE_SCALING_FACTOR``.

    Args:
        max_token_count: Estimated maximum tokens per training example.
        context_length: Base context window size (default ``DEFAULT_MAX_SEQ_LENGTH``).

    Returns:
        Integer scaling factor in the range [1, ``MAX_ROPE_SCALING_FACTOR``].
    """
    rope_scaling_factor = math.ceil(max_token_count / context_length)
    rope_scaling_factor = min(rope_scaling_factor, MAX_ROPE_SCALING_FACTOR)

    return rope_scaling_factor

autoconfig

autoconfig ¶

AutoConfigResolver(data, config) ¶

resolve() ¶

choose_num_input_records_to_sample(rope_scaling_factor) ¶

get_max_token_count(data, group_by) ¶

choose_rope_scaling_factor(max_token_count, context_length=DEFAULT_MAX_SEQ_LENGTH) ¶

`autoconfig` ¶

`AutoConfigResolver(data, config)` ¶

`resolve()` ¶

`choose_num_input_records_to_sample(rope_scaling_factor)` ¶

`get_max_token_count(data, group_by)` ¶

`choose_rope_scaling_factor(max_token_count, context_length=DEFAULT_MAX_SEQ_LENGTH)` ¶