autoconfig
autoconfig
¶
Resolve "auto" sentinel values in config parameters to concrete values.
Inspects dataset characteristics (token counts, record counts) to replace
"auto" placeholders in SafeSynthesizerParameters with computed values
for rope scaling factor, number of input records to sample, delta, and other
training/privacy parameters.
Classes:
| Name | Description |
|---|---|
AutoConfigResolver |
Resolve all |
Functions:
| Name | Description |
|---|---|
choose_num_input_records_to_sample |
Scale training records linearly with the rope scaling factor. |
get_max_token_count |
Estimate the maximum tokens per training example. |
choose_rope_scaling_factor |
Compute the RoPE scaling factor from the estimated max token count. |
AutoConfigResolver(data, config)
¶
Resolve all "auto" sentinel values in SafeSynthesizerParameters.
Inspects the training dataset to compute concrete values for parameters
left as "auto" (rope scaling, number of input records, unsloth,
delta, max sequences per example). Resolution order matters:
rope_scaling_factor is resolved first because
num_input_records_to_sample depends on it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Training dataframe used to derive auto parameters. |
required |
config
|
SafeSynthesizerParameters
|
Configuration containing |
required |
Methods:
| Name | Description |
|---|---|
resolve |
Replace all |
Source code in src/nemo_safe_synthesizer/config/autoconfig.py
resolve()
¶
Replace all "auto" parameters with concrete values.
Resolution order matters: rope_scaling_factor is resolved before
num_input_records_to_sample because the latter depends on it.
Returns:
| Type | Description |
|---|---|
SafeSynthesizerParameters
|
A new |
Source code in src/nemo_safe_synthesizer/config/autoconfig.py
choose_num_input_records_to_sample(rope_scaling_factor)
¶
Scale training records linearly with the rope scaling factor.
num_records = rope_scaling_factor * 25000
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rope_scaling_factor
|
int
|
The RoPE scaling multiplier (1 means no scaling). |
required |
Returns:
| Type | Description |
|---|---|
int
|
Number of records to sample for training. |
Source code in src/nemo_safe_synthesizer/config/autoconfig.py
get_max_token_count(data, group_by)
¶
Estimate the maximum tokens per training example.
Accounts for prompt overhead (~40 tokens), column names (repeated in JSON
formatting), and content character counts. Digits are counted as one token
each; other characters use a 4-chars-per-token heuristic (Llama-2 tokenizer).
Samples up to 5,000 records from data for analysis.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Training dataframe to analyze. |
required |
group_by
|
list[str] | str | None
|
Column(s) used to group records into single training examples. When set, grouped records are concatenated before token estimation. |
required |
Returns:
| Type | Description |
|---|---|
int
|
Estimated maximum token count across all sampled training examples, |
int
|
or 1 if the dataframe is empty. |
Source code in src/nemo_safe_synthesizer/config/autoconfig.py
choose_rope_scaling_factor(max_token_count, context_length=DEFAULT_MAX_SEQ_LENGTH)
¶
Compute the RoPE scaling factor from the estimated max token count.
Divides max_token_count by context_length, rounds up, and
caps the result at MAX_ROPE_SCALING_FACTOR.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_token_count
|
int
|
Estimated maximum tokens per training example. |
required |
context_length
|
int
|
Base context window size (default |
DEFAULT_MAX_SEQ_LENGTH
|
Returns:
| Type | Description |
|---|---|
int
|
Integer scaling factor in the range [1, |