Skip to content

training

training

Classes:

Name Description
TrainingHyperparams

Hyperparameters that control the training process behavior.

TrainingHyperparams pydantic-model

Bases: Parameters

Hyperparameters that control the training process behavior.

This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.

Fields:

num_input_records_to_sample pydantic-field

Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.

batch_size pydantic-field

The batch size per device for training. Must be >= 1.

gradient_accumulation_steps pydantic-field

Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.

weight_decay pydantic-field

The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).

warmup_ratio pydantic-field

Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.

lr_scheduler pydantic-field

The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.

learning_rate pydantic-field

The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.

lora_r pydantic-field

The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.

lora_alpha_over_r pydantic-field

The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].

lora_target_modules pydantic-field

The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.

use_unsloth pydantic-field

Whether to use Unsloth for optimized training.

rope_scaling_factor pydantic-field

Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.

validation_ratio pydantic-field

The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.

validation_steps pydantic-field

The number of steps between validation checks for the HF Trainer arguments. Must be > 0.

pretrained_model pydantic-field

Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.

quantize_model pydantic-field

Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.

quantization_bits pydantic-field

The number of bits to use for quantization if quantize_model is True. Accepts 8 or 4.

peft_implementation pydantic-field

The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.

max_vram_fraction pydantic-field

The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].

attn_implementation pydantic-field

The attention implementation to use for model loading. Default uses Flash Attention 3 via the HuggingFace Kernels Hub (requires the 'kernels' pip package; falls back to 'sdpa' if the 'kernels' package is not installed). Other common values: 'flash_attention_2' (requires flash-attn pip package), 'sdpa' (PyTorch scaled dot product attention), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.