training

`training` ¶

Classes:

Name	Description
`TrainingHyperparams`	Hyperparameters that control the training process behavior.

`TrainingHyperparams` `pydantic-model` ¶

Bases: Parameters

Hyperparameters that control the training process behavior.

This class contains all the fine-tuning hyperparameters that control how the model learns, including learning rates, batch sizes, LoRA configuration, and optimization settings. These parameters directly affect training performance and quality.

Fields:

num_input_records_to_sample (AutoIntParam)
batch_size (int)
gradient_accumulation_steps (int)
weight_decay (float)
warmup_ratio (float)
lr_scheduler (str)
learning_rate (AutoFloatParam)
lora_r (int)
lora_alpha_over_r (float)
lora_target_modules (list[str])
use_unsloth (AutoBoolParam)
rope_scaling_factor (OptionalAutoInt)
validation_ratio (float)
validation_steps (int)
pretrained_model (str)
quantize_model (bool)
quantization_bits (Literal[4, 8])
peft_implementation (str)
max_vram_fraction (float)
attn_implementation (str)

`num_input_records_to_sample` `pydantic-field` ¶

Number of records the model will see during training. This parameter is a proxy for training time. For example, if its value is the same size as the input dataset, this is like training for a single epoch. If its value is larger, this is like training for multiple (possibly fractional) epochs. If its value is smaller, this is like training for a fraction of an epoch. Supports 'auto' where a reasonable value is chosen based on other config params and data.

`batch_size` `pydantic-field` ¶

The batch size per device for training. Must be >= 1.

`gradient_accumulation_steps` `pydantic-field` ¶

Number of update steps to accumulate the gradients for, before performing a backward/update pass. This technique increases the effective batch size that will fit into GPU memory. Must be >= 1.

`weight_decay` `pydantic-field` ¶

The weight decay to apply to all layers except all bias and LayerNorm weights in the AdamW optimizer. Must be in (0, 1).

`warmup_ratio` `pydantic-field` ¶

Ratio of total training steps used for a linear warmup from 0 to the learning rate. Must be > 0.

`lr_scheduler` `pydantic-field` ¶

The scheduler type to use. See the HuggingFace documentation of SchedulerType for all possible values.

`learning_rate` `pydantic-field` ¶

The initial learning rate for AdamW optimizer. Must be in (0, 1). Setting to 'auto' uses a model-specific default if one exists.

`lora_r` `pydantic-field` ¶

The rank of the LoRA update matrices. Lower rank results in smaller update matrices with fewer trainable parameters. Must be > 0.

`lora_alpha_over_r` `pydantic-field` ¶

The ratio of the LoRA scaling factor (alpha) to the LoRA rank. Empirically, this parameter works well when set to 0.5, 1, or 2. Must be in [0.5, 3].

`lora_target_modules` `pydantic-field` ¶

The list of transformer modules to apply LoRA to. Possible modules: 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'.

`use_unsloth` `pydantic-field` ¶

Whether to use Unsloth for optimized training.

`rope_scaling_factor` `pydantic-field` ¶

Scale the base LLM's context length by this factor using RoPE scaling. Must be >= 1 or 'auto'.

`validation_ratio` `pydantic-field` ¶

The fraction of the training data used for validation. Must be in [0, 1]. If set to 0, no validation will be performed. If set larger than 0, validation loss will be computed and reported throughout training.

`validation_steps` `pydantic-field` ¶

The number of steps between validation checks for the HF Trainer arguments. Must be > 0.

`pretrained_model` `pydantic-field` ¶

Pretrained model to use for fine-tuning. Defaults to SmolLM3. May be a Hugging Face model ID (loaded from the Hugging Face Hub or cache) or a local path. See security note in docs before using untrusted sources.

`quantize_model` `pydantic-field` ¶

Whether to quantize the model during training. This can reduce memory usage and potentially speed up training, but may also impact model accuracy.

`quantization_bits` `pydantic-field` ¶

The number of bits to use for quantization if quantize_model is True. Accepts 8 or 4.

`peft_implementation` `pydantic-field` ¶

The PEFT (Parameter-Efficient Fine-Tuning) implementation to use. Options: 'lora' for Low-Rank Adaptation, 'QLORA' for Quantized LoRA.

`max_vram_fraction` `pydantic-field` ¶

The fraction of the total VRAM to use for training. Modify this to allow longer sequences. Must be in [0, 1].

`attn_implementation` `pydantic-field` ¶

The attention implementation to use for model loading. Default uses Flash Attention 3 via the HuggingFace Kernels Hub (requires the 'kernels' pip package; falls back to 'sdpa' if the 'kernels' package is not installed). Other common values: 'flash_attention_2' (requires flash-attn pip package), 'sdpa' (PyTorch scaled dot product attention), 'eager' (standard PyTorch). Custom HuggingFace Kernels Hub paths (e.g. 'kernels-community/flash-attn2') are also supported.

training

training ¶

TrainingHyperparams pydantic-model ¶

num_input_records_to_sample pydantic-field ¶

batch_size pydantic-field ¶

gradient_accumulation_steps pydantic-field ¶

weight_decay pydantic-field ¶

warmup_ratio pydantic-field ¶

lr_scheduler pydantic-field ¶

learning_rate pydantic-field ¶

lora_r pydantic-field ¶

lora_alpha_over_r pydantic-field ¶

lora_target_modules pydantic-field ¶

use_unsloth pydantic-field ¶

rope_scaling_factor pydantic-field ¶

validation_ratio pydantic-field ¶

validation_steps pydantic-field ¶

pretrained_model pydantic-field ¶

quantize_model pydantic-field ¶

quantization_bits pydantic-field ¶

peft_implementation pydantic-field ¶

max_vram_fraction pydantic-field ¶

attn_implementation pydantic-field ¶