training
training
¶
Classes:
| Name | Description |
|---|---|
TrainingHyperparams |
Hyperparameters that control the training process behavior. |
TrainingHyperparams
pydantic-model
¶
Bases: Parameters
Hyperparameters that control the training process behavior.
This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.
Fields:
-
num_input_records_to_sample(AutoIntParam) -
batch_size(int) -
gradient_accumulation_steps(int) -
weight_decay(float) -
warmup_ratio(float) -
lr_scheduler(str) -
learning_rate(AutoFloatParam) -
lora_r(int) -
lora_alpha_over_r(float) -
lora_target_modules(list[str]) -
use_unsloth(AutoBoolParam) -
rope_scaling_factor(OptionalAutoInt) -
validation_ratio(float) -
validation_steps(int) -
pretrained_model(str) -
quantize_model(bool) -
quantization_bits(Literal[4, 8]) -
peft_implementation(str) -
max_vram_fraction(float) -
attn_implementation(str)
num_input_records_to_sample
pydantic-field
¶
Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.
batch_size
pydantic-field
¶
The batch size per device for training. Must be >= 1.
gradient_accumulation_steps
pydantic-field
¶
Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.
weight_decay
pydantic-field
¶
The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).
warmup_ratio
pydantic-field
¶
Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.
lr_scheduler
pydantic-field
¶
The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.
learning_rate
pydantic-field
¶
The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.
lora_r
pydantic-field
¶
The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.
lora_alpha_over_r
pydantic-field
¶
The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].
lora_target_modules
pydantic-field
¶
The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.
use_unsloth
pydantic-field
¶
Whether to use Unsloth for optimized training.
rope_scaling_factor
pydantic-field
¶
Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.
validation_ratio
pydantic-field
¶
The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.
validation_steps
pydantic-field
¶
The number of steps between validation checks for the HF Trainer arguments. Must be > 0.
pretrained_model
pydantic-field
¶
Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.
quantize_model
pydantic-field
¶
Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.
quantization_bits
pydantic-field
¶
The number of bits to use for quantization if quantize_model is True. Accepts 8 or 4.
peft_implementation
pydantic-field
¶
The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.
max_vram_fraction
pydantic-field
¶
The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].
attn_implementation
pydantic-field
¶
The attention implementation to use for model loading. Default uses Flash Attention 3 via the HuggingFace Kernels Hub (requires the 'kernels' pip package; falls back to 'sdpa' if the 'kernels' package is not installed). Other common values: 'flash_attention_2' (requires flash-attn pip package), 'sdpa' (PyTorch scaled dot product attention), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.