Skip to content

Model training

We assume you have /workspace defined in your cluster config and that data and models will be downloaded to that folder, and you already follow dataset.md to get all SFT data ready.

Prepare base model

Download the base model.

Here is an example of commands for Qwen3-30B-A3B

pip install -U "huggingface_hub[cli]"
hf download Qwen/Qwen3-30B-A3B --local-dir Qwen3-30B-A3B

Run training

Run the training (assuming slurm configuration here with the same folder structure). If your cluster has strict timeout policy, you can run multiple dependent jobs with dependent_jobs=N.

The following example shows the training script for Qwen3-30B-A3B. You can modify it accordingly for Qwen3-8B.

from nemo_skills.pipeline.cli  import sft_nemo_rl, wrap_arguments
cluster = 'slurm'
tp = 8
cp = 8
pp = 1
etp = 1
emp = 8
save_period=600
max_steps = 7200
batch_size=2048
num_training_jobs=10
warmup=0
partition = 'interactive'
backend='megatron'
lr=2e-4
min_lr=2e-4
sft_nemo_rl(
    ctx=wrap_arguments(
        '++sft.max_num_epochs=2000 '
        f'++sft.max_num_steps={max_steps} '
        '++data.force_reprocess=false '
        '++data.num_workers=10 '
        f'++policy.megatron_cfg.tensor_model_parallel_size={tp} '
        f'++policy.megatron_cfg.context_parallel_size={cp} '
        f'++policy.megatron_cfg.expert_model_parallel_size={emp} '
        f'++policy.megatron_cfg.expert_tensor_parallel_size={etp} '
        f'++policy.megatron_cfg.pipeline_model_parallel_size={pp} '
        f'++policy.sequence_parallel=True '
        f'++policy.megatron_cfg.bias_activation_fusion=True '
        f'++policy.megatron_cfg.apply_rope_fusion=True '
        f'++checkpointing.save_period={save_period} '
        f'++policy.train_global_batch_size={batch_size} '
        f'++policy.max_total_sequence_length=131072 '
        f'++policy.megatron_cfg.optimizer.lr={lr} '
        '++policy.megatron_cfg.optimizer.bf16=True '
        f'++policy.megatron_cfg.optimizer.min_lr={min_lr} '
        f'++policy.megatron_cfg.scheduler.lr_warmup_iters={warmup} '
        f'++policy.megatron_cfg.scheduler.lr_decay_iters={max_steps} '
        '++policy.megatron_cfg.scheduler.lr_warmup_init=1e-7 '
        '++policy.megatron_cfg.scheduler.lr_decay_style=cosine '
        '++logger.swanlab_enabled=false '
        '++checkpointing.checkpoint_must_save_by=00:03:35:00 '
    ),
    cluster=cluster,
    wandb_project='sft-Qwen3-30B-A3B',
    expname='nemo-rl-sft-Qwen3-30B-A3B',
    backend='megatron',
    output_dir='/workspace/final_sft_model',
    hf_model='/workspace/Qwen3-30B-A3B',
    training_data='/workspace/sft.jsonl',
    num_gpus=8,
    num_nodes=32,
    dependent_jobs=num_training_jobs,
)

Training configuration by model and bucket length

Model Context length TP CP PP ETP EMP
Qwen3-30B-A3B 16k 4 2 1 1 4
Qwen3-30B-A3B 32k 4 4 1 1 8
Qwen3-30B-A3B 64k 4 8 1 1 8
Qwen3-30B-A3B 128k 4 8 1 1 8
Qwen3-8B 16k 2 2 1 - -
Qwen3-8B 32k 2 4 1 - -
Qwen3-8B 64k 4 4 1 - -
Qwen3-8B 128k 8 8 1 - -