training

`training` ¶

Classes:

Name	Description
`QuantizationScheme`	Quantization schemes supported when `quantize_model=True`.
`TrainingHyperparams`	Hyperparameters that control the training process behavior.

`QuantizationScheme` ¶

Bases: StrEnum

Quantization schemes supported when quantize_model=True.

Members are string values so they serialize cleanly through pydantic and JSON configs. The enum also owns construction of the corresponding transformers quantization_config object; optional ML dependencies stay locally imported in that construction path.

Selection guide: - bnb-4bit / bnb-8bit: bitsandbytes NF4 / int8. Widest hardware support (Ampere+), works with QLoRA and LoftQ. Default for training. - fp8: transformers FineGrainedFP8Config. Float8 with block-wise scaling. Requires Hopper (sm_90+) or Blackwell. Inference-leaning. - nvfp4: NVIDIA FP4 via torchao.prototype.mx_formats.NVFP4WeightOnlyConfig wrapped in TorchAoConfig. Requires Blackwell (sm_100+). Weight-only. - mxfp4: OCP Microscaling FP4 via transformers Mxfp4Config. Hardware support varies by torch/torchao version.

Methods:

Name	Description
`from_alias`	Normalize string and legacy bit-count aliases to a scheme.
`to_transformers_config`	Build the transformers quantization config for this scheme.

Attributes:

Name	Type	Description
`effective_bits`	`int`	Per-parameter bit width for memory estimation.
`is_bitsandbytes`	`bool`	Whether the scheme is implemented via bitsandbytes (QLoRA-compatible).

`effective_bits` `property` ¶

Per-parameter bit width for memory estimation.

`is_bitsandbytes` `property` ¶

Whether the scheme is implemented via bitsandbytes (QLoRA-compatible).

`from_alias(scheme)` `classmethod` ¶

Normalize string and legacy bit-count aliases to a scheme.

Source code in src/nemo_safe_synthesizer/config/training.py

@classmethod
def from_alias(cls, scheme: QuantizationScheme | str | Literal[4, 8]) -> QuantizationScheme:
    """Normalize string and legacy bit-count aliases to a scheme."""
    if isinstance(scheme, int):
        legacy_aliases = {
            4: cls.BNB_4BIT,
            8: cls.BNB_8BIT,
        }
        try:
            return legacy_aliases[scheme]
        except KeyError as exc:
            raise ValueError(f"Unknown quantization bit-count alias: {scheme!r}. Expected 4 or 8.") from exc
    return cls(scheme)

`to_transformers_config()` ¶

Build the transformers quantization config for this scheme.

Source code in src/nemo_safe_synthesizer/config/training.py

def to_transformers_config(self) -> QuantizationConfigMixin:
    """Build the transformers quantization config for this scheme."""
    match self:
        case QuantizationScheme.BNB_4BIT:
            return self._bnb_4bit_config()
        case QuantizationScheme.BNB_8BIT:
            return self._bnb_8bit_config()
        case QuantizationScheme.FP8:
            return self._fp8_config()
        case QuantizationScheme.NVFP4:
            return self._nvfp4_config()
        case QuantizationScheme.MXFP4:
            return self._mxfp4_config()
    raise ValueError(f"Unknown quantization scheme: {self!r}")

`TrainingHyperparams` `pydantic-model` ¶

Bases: Parameters

Hyperparameters that control the training process behavior.

This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.

Fields:

num_input_records_to_sample (AutoIntParam)
batch_size (int)
gradient_accumulation_steps (int)
weight_decay (float)
warmup_ratio (float)
lr_scheduler (str)
learning_rate (AutoFloatParam)
lora_r (int)
lora_alpha_over_r (float)
lora_target_modules (list[str])
rope_scaling_factor (OptionalAutoInt)
validation_ratio (float)
validation_steps (int)
pretrained_model (str)
quantize_model (bool)
quantization_bits (Literal[4, 8])
quantization_scheme (QuantizationScheme | None)
peft_implementation (str)
max_vram_fraction (float)
attn_implementation (str)

`num_input_records_to_sample` `pydantic-field` ¶

Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.

`batch_size` `pydantic-field` ¶

The batch size per device for training. Must be >= 1.

`gradient_accumulation_steps` `pydantic-field` ¶

Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.

`weight_decay` `pydantic-field` ¶

The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).

`warmup_ratio` `pydantic-field` ¶

Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.

`lr_scheduler` `pydantic-field` ¶

The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.

`learning_rate` `pydantic-field` ¶

The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.

`lora_r` `pydantic-field` ¶

The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.

`lora_alpha_over_r` `pydantic-field` ¶

The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].

`lora_target_modules` `pydantic-field` ¶

The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.

`rope_scaling_factor` `pydantic-field` ¶

Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.

`validation_ratio` `pydantic-field` ¶

The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.

`validation_steps` `pydantic-field` ¶

The number of steps between validation checks for the HF Trainer arguments. Must be > 0.

`pretrained_model` `pydantic-field` ¶

Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.

`quantize_model` `pydantic-field` ¶

Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.

`quantization_bits` `pydantic-field` ¶

Deprecated: use quantization_scheme instead. Bit width for bitsandbytes quantization when quantization_scheme is not set (back-compat alias: 4 → bnb-4bit, 8 → bnb-8bit).

`quantization_scheme` `pydantic-field` ¶

Quantization scheme to use when quantize_model=True. Accepts bnb-4bit, bnb-8bit, fp8, nvfp4, or mxfp4. If unset, falls back to quantization_bits for backward compatibility. Non-bitsandbytes schemes are incompatible with peft_implementation='loftq'.

`peft_implementation` `pydantic-field` ¶

The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.

`max_vram_fraction` `pydantic-field` ¶

The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].

`attn_implementation` `pydantic-field` ¶

The attention implementation to use for model loading. Default uses 'sdpa' (PyTorch scaled dot product attention) for broad compatibility. Other common values: 'flash_attention_2' (requires flash-attn pip package), 'flash_attention_3' (requires flash-attn-3 support), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.

`effective_batch_size` `property` ¶

Effective batch size = batch_size * gradient_accumulation_steps.

This is the number of examples that contribute to each optimizer update (the "global" batch seen by the loss curve). Canonical source for any caller that needs this product -- used by preflight checks and logged by the training callbacks.

training

training ¶

QuantizationScheme ¶

effective_bits property ¶

is_bitsandbytes property ¶

from_alias(scheme) classmethod ¶

to_transformers_config() ¶

TrainingHyperparams pydantic-model ¶

num_input_records_to_sample pydantic-field ¶

batch_size pydantic-field ¶

gradient_accumulation_steps pydantic-field ¶

weight_decay pydantic-field ¶

warmup_ratio pydantic-field ¶

lr_scheduler pydantic-field ¶

learning_rate pydantic-field ¶

lora_r pydantic-field ¶

lora_alpha_over_r pydantic-field ¶

lora_target_modules pydantic-field ¶

rope_scaling_factor pydantic-field ¶

validation_ratio pydantic-field ¶

validation_steps pydantic-field ¶

pretrained_model pydantic-field ¶

quantize_model pydantic-field ¶

quantization_bits pydantic-field ¶

quantization_scheme pydantic-field ¶

peft_implementation pydantic-field ¶

max_vram_fraction pydantic-field ¶

attn_implementation pydantic-field ¶

effective_batch_size property ¶