Stage 1: Supervised Fine-Tuning (SFT)#
This stage fine-tunes the pretrained model for instruction following using Megatron-Bridge.
Open-Source Data Only: This recipe uses exclusively open-sourced SFT data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-Science-v1, Nemotron-Instruction-Following-Chat-v1, Nemotron-Math-Proofs-v1, Nemotron-SWE-v1, Nemotron-Agentic-v1, and Nemotron-Competitive-Programming-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.
Training Methodology#
Training Framework: SFT is implemented using Megatron-Bridge’s
finetune()entry point, which loads a pretrained checkpoint and handles the training loop with role-based loss masking. See Training Entry Points for implementation details.For complete methodology, see Tech Report Section 3.1.
Data Preparation Pipeline#
Before training, chat conversations are transformed into training-ready sequences through several stages:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
subgraph prep["Data Preparation"]
direction LR
chat["OpenAI Chat<br/>Format"] --> template["Chat<br/>Template"]
template --> chunks["Role-Labeled<br/>Chunks"]
chunks --> tok["Tokenization"]
tok --> mask["Loss Mask<br/>(role-based)"]
mask --> pack["Packing"]
pack --> roll["Mask Rolling"]
end
roll --> npy[".npy Output"]
style chat fill:#e3f2fd,stroke:#2196f3
style template fill:#e3f2fd,stroke:#2196f3
style chunks fill:#e3f2fd,stroke:#2196f3
style tok fill:#f3e5f5,stroke:#9c27b0
style mask fill:#f3e5f5,stroke:#9c27b0
style pack fill:#fff3e0,stroke:#ff9800
style roll fill:#fff3e0,stroke:#ff9800
style npy fill:#e8f5e9,stroke:#4caf50
Stage |
What Happens |
|---|---|
OpenAI Chat Format |
Input messages with |
Chat Template |
Renders messages using Nano3 Jinja template with special tokens ( |
Role-Labeled Chunks |
Splits rendered text back into chunks, each tagged with its source role |
Tokenization |
Converts text chunks to token IDs |
Loss Mask |
Builds mask: |
Packing |
Multiple sequences packed into fixed-length bins (4096 tokens) |
Mask Rolling |
Shifts mask by 1 position for next-token prediction alignment |
Multi-turn splitting: For conversations with reasoning content (reasoning_content field), the pipeline creates separate training sequences at each user turn. Reasoning from previous turns is dropped when a new user message appears—this matches inference behavior where users don’t see intermediate reasoning.
For data preparation implementation, see Recipe Source:
src/nemotron/recipes/nano3/stage1_sft/data_prep.py
Loss Masking#
Loss masking determines which tokens contribute to the training loss. In SFT, we only want the model to learn to generate responses—not to predict prompts or system instructions.
Why mask non-assistant tokens?
The model should learn to respond, not to prompt. If we computed loss on user messages, the model would be optimized to predict “What is 2+2?” given prior context—which isn’t useful for an assistant. By masking user and system tokens (setting their loss weight to 0), gradients only flow from assistant responses, teaching the model what to generate without wasting capacity on predicting inputs.
Role |
Loss Mask |
Training Signal |
|---|---|---|
|
0 |
Ignored (instructions) |
|
0 |
Ignored (prompts) |
|
1 |
Learned (responses) |
Why roll the mask by 1?
In next-token prediction, the model predicts token[i+1] given tokens[0:i]. The loss compares the prediction against the label, which is the input sequence shifted by one position:
Position: 0 1 2 3 4
Input: [A] [B] [C] [D] [E]
Label: [B] [C] [D] [E] [_] <- shifted by 1
If assistant content starts at position 2 ([C]), we want loss on predicting [C], [D], and [E]. But the label for position 2 is [D]—so we need to shift the mask to align with labels:
Original mask: [0] [0] [1] [1] [1] <- "assistant starts at C"
Rolled mask: [0] [0] [0] [1] [1] <- aligns with labels D, E
The pipeline rolls the loss mask by 1 position so it correctly masks the predictions (labels) rather than the inputs.
Truncation behavior (max_doc_tokens):
Default (null): No truncation—full sequences are preserved
When set: Sequences exceeding the limit are truncated from the end, with the loss mask adjusted accordingly
For implementation details, see
src/nemotron/data_prep/chat_sft_processor.py
Packed Sequences#
Why pack sequences?
Individual chat conversations vary in length—some are 50 tokens, others 3000. Without packing, each training sample would require padding to the maximum sequence length, wasting compute on empty tokens. Packing concatenates multiple conversations into a single fixed-length sequence (default 4096 tokens), maximizing GPU utilization.
The packed sequence format stores everything Megatron-Bridge needs for training:
Field |
Description |
|---|---|
|
Concatenated token IDs from multiple conversations |
|
Rolled mask indicating which positions contribute to loss (see Loss Masking) |
|
Boundary indices marking where each original conversation starts within the pack |
How seq_start_id works:
When multiple conversations are packed together, the model needs to know where one ends and another begins—otherwise attention could “leak” between unrelated conversations. The seq_start_id array marks these boundaries:
Pack: [Conv A tokens] [Conv B tokens] [Conv C tokens]
^ ^ ^
seq_start_id: [0, 128, 384]
Megatron-Bridge uses these boundaries for:
Variable-length attention: Attention is masked so tokens from Conv A can’t attend to Conv B
FlashAttention optimization: Boundaries map to
cu_seqlensparameter for efficient packed attention
For packing implementation, see
src/nemotron/data_prep/packing/builder.py
Chat Template#
Nemotron 3 Nano supports both reasoning and non-reasoning modes:
Multi-Step: Existing reasoning tokens preserved for reuse in subsequent steps
Multi-Turn: Reasoning from previous turns dropped when user message introduced
Tool Calling: Uses XML-style special tags to reduce character escaping
SFT Data Domains#
Domain |
Description |
|---|---|
Competition Math |
Tool-integrated reasoning with GPT-OSS teachers |
Competition Code |
OpenCodeReasoning solutions with obfuscation/complication |
InfinityByte |
Cross-domain code synthesis at model capability boundaries |
STEM Reasoning (RQA) |
Reasoning Q&A from undergraduate/graduate STEM content |
Conversational Tool Use |
Multi-turn trajectories with simulated tool execution |
Long Context |
128k mean token length, 256k hard limit |
Formal Proofs |
Lean theorem proving with 300k examples |
Multilingual |
French, Spanish, Italian, German, Japanese |
Terminal Use |
Terminal operations from Terminal Bench |
General Chat |
Multi-turn responses from LMSYS and WildChat |
Instruction Following |
Tulu 3 methodology with verifier filtering |
Safety |
Refusal behaviors from safety datasets |
Software Engineering |
GitHub issue resolution trajectories |
Science |
Physics, chemistry, biology via NeMo Data Designer |
For detailed data generation pipelines, see Tech Report Section 3.1.
Data Filtering#
The pipeline applies:
Structural checks: Discard malformed examples
Pathological repetition filtering: Remove repeated n-grams
Consistency filtering: Judge-based action consistency verification
Narrative filtering: Remove political/nationalistic narratives
Troubleshooting#
Common data preparation errors and solutions:
Error |
Cause |
Solution |
|---|---|---|
“# Tools missing” validation failure |
Messages contain |
Add a |
Empty sequences after processing |
All tokens masked (no assistant content in conversation) |
Verify input data contains assistant responses with actual content |
Template rendering mismatch |
Tokenizer BPE splits differ from template expectations |
Ensure tokenizer model matches the one used during template creation |
Sequences truncated excessively |
Many conversations exceed |
Consider increasing |
Debugging tips:
Use
--sample 100to test data preparation on a small subsetCheck
metadata.jsonoutput for statistics on filtered/truncated sequencesReview W&B artifacts for lineage tracking and validation metrics
Hyperparameters#
Parameter |
Value |
|---|---|
Learning Rate |
1e-5 |
Sequence Length |
4096 tokens (pack_size) |
Loss Masking |
Role-based (assistant tokens only) |
Loss Normalization |
Per-token ( |
Optimizer |
AdamW |
Total Samples |
18M+ |
calculate_per_token_loss explained:
True (default): Loss is normalized by the number of tokens with
loss_mask=1across the batch. Each token contributes equally regardless of which sequence it belongs to.False: Loss is normalized by the number of sequences. Longer sequences (more assistant tokens) contribute more to the gradient.
Per-token normalization is preferred for SFT because it ensures consistent learning signal regardless of conversation length.
Recipe Execution#
Quick Start#
// 1. Prepare data (apply chat templates, tokenize to .npy)
$ uv run nemotron nano3 data prep sft --run YOUR-CLUSTER
// 2. Run SFT
$ uv run nemotron nano3 sft --run YOUR-CLUSTER
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Direct Script Execution (Megatron-Bridge)#
For direct execution outside this CLI, use the scripts in the Megatron-Bridge repository:
# Clone the repository and checkout the nano-v3 branch
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
git checkout nano-v3
# Run fine-tuning (inside container on compute node)
python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
--per-split-data-args-path /path/to/data_args.json \
--tokenizer-model /path/to/tokenizer.model
# With config file overrides
python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
--config-file /path/to/overrides.yaml \
--per-split-data-args-path /path/to/data_args.json \
--tokenizer-model /path/to/tokenizer.model
See the Megatron-Bridge Nemotron 3 documentation for detailed configuration options.
Configuration#
File |
Purpose |
|---|---|
|
Production configuration |
|
Data preparation settings |
|
Dataset blend definition |
Data Preparation#
The data_prep.py script processes OpenAI-format chat data into packed sequences with role-based loss masking. See Data Preparation Module for detailed documentation.
CLI Command#
uv run nemotron nano3 data prep sft [options]
Option |
Description |
|---|---|
|
Execute on Slurm via NeMo-Run |
|
Limit rows per dataset (for testing) |
|
Force re-run, ignoring cache |
Output#
output/stage1_sft/
├── training.npy
├── validation.npy
├── test.npy
└── metadata.json
The output is registered as a W&B Artifact (DataBlendsArtifact-sft) for lineage tracking.
Training#
CLI Command#
uv run nemotron nano3 sft [options] [overrides...]
Option |
Description |
|---|---|
|
Attached—submits and waits, streaming logs (NeMo-Run) |
|
Detached—submits and exits immediately (NeMo-Run) |
|
Preview execution plan |
|
Override config values (CLI Framework) |
Override Examples#
# More training iterations
uv run nemotron nano3 sft train.train_iters=5000
# Different learning rate
uv run nemotron nano3 sft optimizer.lr=1e-5
# Load specific pretrained checkpoint
uv run nemotron nano3 sft checkpoint.load=/path/to/pretrain/checkpoint
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
prev["ModelArtifact-pretrain<br/>(from Stage 0)"] --> train
inst["Instruction Datasets<br/>(OpenAI chat format)"] --> dp["data_prep.py"]
dp --> data["DataBlendsArtifact-sft<br/>(packed .npy files)"]
data --> train["train.py"]
train --> model["ModelArtifact-sft<br/>(fine-tuned checkpoint)"]
model --> next["Stage 2: RL"]
style prev fill:#e1f5fe,stroke:#2196f3
style inst fill:#f3e5f5,stroke:#9c27b0
style dp fill:#f3e5f5,stroke:#9c27b0
style data fill:#f3e5f5,stroke:#9c27b0
style train fill:#f3e5f5,stroke:#9c27b0
style model fill:#f3e5f5,stroke:#9c27b0
style next fill:#e8f5e9,stroke:#4caf50
Infrastructure#
This stage uses the following components from the NVIDIA AI Stack:
Component |
Role |
Documentation |
|---|---|---|
Distributed training primitives (TP, PP, DP, EP) |
||
Fine-tuning loop, checkpoint loading, loss masking |
Key Features Used#
Feature |
Purpose |
|---|---|
|
SFT training with pre-loaded checkpoint |
Role-based loss masking |
Only compute loss on assistant tokens |
Mixed precision (BF16) |
Memory-efficient training |
Gradient checkpointing |
Reduce memory footprint |
Container#
nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
Next Steps#
After SFT completes, proceed to Stage 2: RL for alignment training.
Reference#
Tech Report Section 3.1 — SFT methodology
NVIDIA AI Stack — Megatron-Core, Megatron-Bridge documentation
Artifact Lineage — W&B artifact system
Stage 0: Pretraining — Pretrain the base model
Recipe Source:
src/nemotron/recipes/nano3/stage1_sft/— Implementation details