Model training¶
We assume you have /workspace defined in your cluster config and
that data and models will be downloaded to that folder, and you already follow dataset.md to get all SFT data ready.
Prepare base model¶
Download the base model.
Here is an example of commands for Qwen3-30B-A3B
Run training¶
Run the training (assuming slurm configuration here with the same folder structure). If your cluster has strict
timeout policy, you can run multiple dependent jobs with dependent_jobs=N.
The following example shows the training script for Qwen3-30B-A3B. You can modify it accordingly for Qwen3-8B.
from nemo_skills.pipeline.cli import sft_nemo_rl, wrap_arguments
cluster = 'slurm'
tp = 8
cp = 8
pp = 1
etp = 1
emp = 8
save_period=600
max_steps = 7200
batch_size=2048
num_training_jobs=10
warmup=0
partition = 'interactive'
backend='megatron'
lr=2e-4
min_lr=2e-4
sft_nemo_rl(
ctx=wrap_arguments(
'++sft.max_num_epochs=2000 '
f'++sft.max_num_steps={max_steps} '
'++data.force_reprocess=false '
'++data.num_workers=10 '
f'++policy.megatron_cfg.tensor_model_parallel_size={tp} '
f'++policy.megatron_cfg.context_parallel_size={cp} '
f'++policy.megatron_cfg.expert_model_parallel_size={emp} '
f'++policy.megatron_cfg.expert_tensor_parallel_size={etp} '
f'++policy.megatron_cfg.pipeline_model_parallel_size={pp} '
f'++policy.sequence_parallel=True '
f'++policy.megatron_cfg.bias_activation_fusion=True '
f'++policy.megatron_cfg.apply_rope_fusion=True '
f'++checkpointing.save_period={save_period} '
f'++policy.train_global_batch_size={batch_size} '
f'++policy.max_total_sequence_length=131072 '
f'++policy.megatron_cfg.optimizer.lr={lr} '
'++policy.megatron_cfg.optimizer.bf16=True '
f'++policy.megatron_cfg.optimizer.min_lr={min_lr} '
f'++policy.megatron_cfg.scheduler.lr_warmup_iters={warmup} '
f'++policy.megatron_cfg.scheduler.lr_decay_iters={max_steps} '
'++policy.megatron_cfg.scheduler.lr_warmup_init=1e-7 '
'++policy.megatron_cfg.scheduler.lr_decay_style=cosine '
'++logger.swanlab_enabled=false '
'++checkpointing.checkpoint_must_save_by=00:03:35:00 '
),
cluster=cluster,
wandb_project='sft-Qwen3-30B-A3B',
expname='nemo-rl-sft-Qwen3-30B-A3B',
backend='megatron',
output_dir='/workspace/final_sft_model',
hf_model='/workspace/Qwen3-30B-A3B',
training_data='/workspace/sft.jsonl',
num_gpus=8,
num_nodes=32,
dependent_jobs=num_training_jobs,
)
Training configuration by model and bucket length¶
| Model | Context length | TP | CP | PP | ETP | EMP |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 16k | 4 | 2 | 1 | 1 | 4 |
| Qwen3-30B-A3B | 32k | 4 | 4 | 1 | 1 | 8 |
| Qwen3-30B-A3B | 64k | 4 | 8 | 1 | 1 | 8 |
| Qwen3-30B-A3B | 128k | 4 | 8 | 1 | 1 | 8 |
| Qwen3-8B | 16k | 2 | 2 | 1 | - | - |
| Qwen3-8B | 32k | 2 | 4 | 1 | - | - |
| Qwen3-8B | 64k | 4 | 4 | 1 | - | - |
| Qwen3-8B | 128k | 8 | 8 | 1 | - | - |