training
training
¶
Classes:
| Name | Description |
|---|---|
QuantizationScheme |
Quantization schemes supported when |
TrainingHyperparams |
Hyperparameters that control the training process behavior. |
QuantizationScheme
¶
Bases: StrEnum
Quantization schemes supported when quantize_model=True.
Members are string values so they serialize cleanly through pydantic
and JSON configs. The enum also owns construction of the corresponding
transformers quantization_config object; optional ML dependencies stay
locally imported in that construction path.
Selection guide:
- bnb-4bit / bnb-8bit: bitsandbytes NF4 / int8. Widest hardware
support (Ampere+), works with QLoRA and LoftQ. Default for training.
- fp8: transformers FineGrainedFP8Config. Float8 with block-wise
scaling. Requires Hopper (sm_90+) or Blackwell. Inference-leaning.
- nvfp4: NVIDIA FP4 via torchao.prototype.mx_formats.NVFP4WeightOnlyConfig
wrapped in TorchAoConfig. Requires Blackwell (sm_100+). Weight-only.
- mxfp4: OCP Microscaling FP4 via transformers Mxfp4Config.
Hardware support varies by torch/torchao version.
Methods:
| Name | Description |
|---|---|
from_alias |
Normalize string and legacy bit-count aliases to a scheme. |
to_transformers_config |
Build the transformers quantization config for this scheme. |
Attributes:
| Name | Type | Description |
|---|---|---|
effective_bits |
int
|
Per-parameter bit width for memory estimation. |
is_bitsandbytes |
bool
|
Whether the scheme is implemented via bitsandbytes (QLoRA-compatible). |
effective_bits
property
¶
Per-parameter bit width for memory estimation.
is_bitsandbytes
property
¶
Whether the scheme is implemented via bitsandbytes (QLoRA-compatible).
from_alias(scheme)
classmethod
¶
Normalize string and legacy bit-count aliases to a scheme.
Source code in src/nemo_safe_synthesizer/config/training.py
to_transformers_config()
¶
Build the transformers quantization config for this scheme.
Source code in src/nemo_safe_synthesizer/config/training.py
TrainingHyperparams
pydantic-model
¶
Bases: Parameters
Hyperparameters that control the training process behavior.
This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.
Fields:
-
num_input_records_to_sample(AutoIntParam) -
batch_size(int) -
gradient_accumulation_steps(int) -
weight_decay(float) -
warmup_ratio(float) -
lr_scheduler(str) -
learning_rate(AutoFloatParam) -
lora_r(int) -
lora_alpha_over_r(float) -
lora_target_modules(list[str]) -
rope_scaling_factor(OptionalAutoInt) -
validation_ratio(float) -
validation_steps(int) -
pretrained_model(str) -
quantize_model(bool) -
quantization_bits(Literal[4, 8]) -
quantization_scheme(QuantizationScheme | None) -
peft_implementation(str) -
max_vram_fraction(float) -
attn_implementation(str)
num_input_records_to_sample
pydantic-field
¶
Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.
batch_size
pydantic-field
¶
The batch size per device for training. Must be >= 1.
gradient_accumulation_steps
pydantic-field
¶
Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.
weight_decay
pydantic-field
¶
The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).
warmup_ratio
pydantic-field
¶
Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.
lr_scheduler
pydantic-field
¶
The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.
learning_rate
pydantic-field
¶
The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.
lora_r
pydantic-field
¶
The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.
lora_alpha_over_r
pydantic-field
¶
The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].
lora_target_modules
pydantic-field
¶
The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
rope_scaling_factor
pydantic-field
¶
Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.
validation_ratio
pydantic-field
¶
The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.
validation_steps
pydantic-field
¶
The number of steps between validation checks for the HF Trainer arguments. Must be > 0.
pretrained_model
pydantic-field
¶
Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.
quantize_model
pydantic-field
¶
Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.
quantization_bits
pydantic-field
¶
Deprecated: use quantization_scheme instead. Bit width for bitsandbytes quantization when quantization_scheme is not set (back-compat alias: 4 → bnb-4bit, 8 → bnb-8bit).
quantization_scheme
pydantic-field
¶
Quantization scheme to use when quantize_model=True. Accepts bnb-4bit, bnb-8bit, fp8, nvfp4, or mxfp4. If unset, falls back to quantization_bits for backward compatibility. Non-bitsandbytes schemes are incompatible with peft_implementation='loftq'.
peft_implementation
pydantic-field
¶
The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.
max_vram_fraction
pydantic-field
¶
The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].
attn_implementation
pydantic-field
¶
The attention implementation to use for model loading. Default uses 'sdpa' (PyTorch scaled dot product attention) for broad compatibility. Other common values: 'flash_attention_2' (requires flash-attn pip package), 'flash_attention_3' (requires flash-attn-3 support), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.
effective_batch_size
property
¶
Effective batch size = batch_size * gradient_accumulation_steps.
This is the number of examples that contribute to each optimizer update (the "global" batch seen by the loss curve). Canonical source for any caller that needs this product -- used by preflight checks and logged by the training callbacks.