Skip to content

training

training

Classes:

Name Description
QuantizationScheme

Quantization schemes supported when quantize_model=True.

TrainingHyperparams

Hyperparameters that control the training process behavior.

QuantizationScheme

Bases: StrEnum

Quantization schemes supported when quantize_model=True.

Members are string values so they serialize cleanly through pydantic and JSON configs. The enum also owns construction of the corresponding transformers quantization_config object; optional ML dependencies stay locally imported in that construction path.

Selection guide: - bnb-4bit / bnb-8bit: bitsandbytes NF4 / int8. Widest hardware support (Ampere+), works with QLoRA and LoftQ. Default for training. - fp8: transformers FineGrainedFP8Config. Float8 with block-wise scaling. Requires Hopper (sm_90+) or Blackwell. Inference-leaning. - nvfp4: NVIDIA FP4 via torchao.prototype.mx_formats.NVFP4WeightOnlyConfig wrapped in TorchAoConfig. Requires Blackwell (sm_100+). Weight-only. - mxfp4: OCP Microscaling FP4 via transformers Mxfp4Config. Hardware support varies by torch/torchao version.

Methods:

Name Description
from_alias

Normalize string and legacy bit-count aliases to a scheme.

to_transformers_config

Build the transformers quantization config for this scheme.

Attributes:

Name Type Description
effective_bits int

Per-parameter bit width for memory estimation.

is_bitsandbytes bool

Whether the scheme is implemented via bitsandbytes (QLoRA-compatible).

effective_bits property

Per-parameter bit width for memory estimation.

is_bitsandbytes property

Whether the scheme is implemented via bitsandbytes (QLoRA-compatible).

from_alias(scheme) classmethod

Normalize string and legacy bit-count aliases to a scheme.

Source code in src/nemo_safe_synthesizer/config/training.py
@classmethod
def from_alias(cls, scheme: QuantizationScheme | str | Literal[4, 8]) -> QuantizationScheme:
    """Normalize string and legacy bit-count aliases to a scheme."""
    if isinstance(scheme, int):
        legacy_aliases = {
            4: cls.BNB_4BIT,
            8: cls.BNB_8BIT,
        }
        try:
            return legacy_aliases[scheme]
        except KeyError as exc:
            raise ValueError(f"Unknown quantization bit-count alias: {scheme!r}. Expected 4 or 8.") from exc
    return cls(scheme)

to_transformers_config()

Build the transformers quantization config for this scheme.

Source code in src/nemo_safe_synthesizer/config/training.py
def to_transformers_config(self) -> QuantizationConfigMixin:
    """Build the transformers quantization config for this scheme."""
    match self:
        case QuantizationScheme.BNB_4BIT:
            return self._bnb_4bit_config()
        case QuantizationScheme.BNB_8BIT:
            return self._bnb_8bit_config()
        case QuantizationScheme.FP8:
            return self._fp8_config()
        case QuantizationScheme.NVFP4:
            return self._nvfp4_config()
        case QuantizationScheme.MXFP4:
            return self._mxfp4_config()
    raise ValueError(f"Unknown quantization scheme: {self!r}")

TrainingHyperparams pydantic-model

Bases: Parameters

Hyperparameters that control the training process behavior.

This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.

Fields:

num_input_records_to_sample pydantic-field

Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.

batch_size pydantic-field

The batch size per device for training. Must be >= 1.

gradient_accumulation_steps pydantic-field

Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.

weight_decay pydantic-field

The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).

warmup_ratio pydantic-field

Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.

lr_scheduler pydantic-field

The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.

learning_rate pydantic-field

The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.

lora_r pydantic-field

The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.

lora_alpha_over_r pydantic-field

The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].

lora_target_modules pydantic-field

The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.

rope_scaling_factor pydantic-field

Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.

validation_ratio pydantic-field

The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.

validation_steps pydantic-field

The number of steps between validation checks for the HF Trainer arguments. Must be > 0.

pretrained_model pydantic-field

Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.

quantize_model pydantic-field

Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.

quantization_bits pydantic-field

Deprecated: use quantization_scheme instead. Bit width for bitsandbytes quantization when quantization_scheme is not set (back-compat alias: 4 → bnb-4bit, 8 → bnb-8bit).

quantization_scheme pydantic-field

Quantization scheme to use when quantize_model=True. Accepts bnb-4bit, bnb-8bit, fp8, nvfp4, or mxfp4. If unset, falls back to quantization_bits for backward compatibility. Non-bitsandbytes schemes are incompatible with peft_implementation='loftq'.

peft_implementation pydantic-field

The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.

max_vram_fraction pydantic-field

The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].

attn_implementation pydantic-field

The attention implementation to use for model loading. Default uses 'sdpa' (PyTorch scaled dot product attention) for broad compatibility. Other common values: 'flash_attention_2' (requires flash-attn pip package), 'flash_attention_3' (requires flash-attn-3 support), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.

effective_batch_size property

Effective batch size = batch_size * gradient_accumulation_steps.

This is the number of examples that contribute to each optimizer update (the "global" batch seen by the loss curve). Canonical source for any caller that needs this product -- used by preflight checks and logged by the training callbacks.