Stage 1: Supervised Fine-Tuning (SFT)#

This stage fine-tunes the pretrained model for instruction following using Megatron-Bridge.

Open-Source Data Only: This recipe uses exclusively open-sourced SFT data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-Science-v1, Nemotron-Instruction-Following-Chat-v1, Nemotron-Math-Proofs-v1, Nemotron-SWE-v1, Nemotron-Agentic-v1, and Nemotron-Competitive-Programming-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.


Training Methodology#

Training Framework: SFT is implemented using Megatron-Bridge’s finetune() entry point, which loads a pretrained checkpoint and handles the training loop with role-based loss masking. See Training Entry Points for implementation details.

For complete methodology, see Tech Report Section 3.1.

Data Preparation Pipeline#

Before training, chat conversations are transformed into training-ready sequences through several stages:

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
    subgraph prep["Data Preparation"]
        direction LR
        chat["OpenAI Chat<br/>Format"] --> template["Chat<br/>Template"]
        template --> chunks["Role-Labeled<br/>Chunks"]
        chunks --> tok["Tokenization"]
        tok --> mask["Loss Mask<br/>(role-based)"]
        mask --> pack["Packing"]
        pack --> roll["Mask Rolling"]
    end
    roll --> npy[".npy Output"]

    style chat fill:#e3f2fd,stroke:#2196f3
    style template fill:#e3f2fd,stroke:#2196f3
    style chunks fill:#e3f2fd,stroke:#2196f3
    style tok fill:#f3e5f5,stroke:#9c27b0
    style mask fill:#f3e5f5,stroke:#9c27b0
    style pack fill:#fff3e0,stroke:#ff9800
    style roll fill:#fff3e0,stroke:#ff9800
    style npy fill:#e8f5e9,stroke:#4caf50
    

Stage

What Happens

OpenAI Chat Format

Input messages with role (system/user/assistant) and content fields

Chat Template

Renders messages using Nano3 Jinja template with special tokens (<|im_start|>, <|im_end|>)

Role-Labeled Chunks

Splits rendered text back into chunks, each tagged with its source role

Tokenization

Converts text chunks to token IDs

Loss Mask

Builds mask: 1 for assistant tokens, 0 for system/user tokens

Packing

Multiple sequences packed into fixed-length bins (4096 tokens)

Mask Rolling

Shifts mask by 1 position for next-token prediction alignment

Multi-turn splitting: For conversations with reasoning content (reasoning_content field), the pipeline creates separate training sequences at each user turn. Reasoning from previous turns is dropped when a new user message appears—this matches inference behavior where users don’t see intermediate reasoning.

For data preparation implementation, see Recipe Source: src/nemotron/recipes/nano3/stage1_sft/data_prep.py

Loss Masking#

Loss masking determines which tokens contribute to the training loss. In SFT, we only want the model to learn to generate responses—not to predict prompts or system instructions.

Why mask non-assistant tokens?

The model should learn to respond, not to prompt. If we computed loss on user messages, the model would be optimized to predict “What is 2+2?” given prior context—which isn’t useful for an assistant. By masking user and system tokens (setting their loss weight to 0), gradients only flow from assistant responses, teaching the model what to generate without wasting capacity on predicting inputs.

Role

Loss Mask

Training Signal

system

0

Ignored (instructions)

user

0

Ignored (prompts)

assistant

1

Learned (responses)

Why roll the mask by 1?

In next-token prediction, the model predicts token[i+1] given tokens[0:i]. The loss compares the prediction against the label, which is the input sequence shifted by one position:

Position:     0    1    2    3    4
Input:       [A]  [B]  [C]  [D]  [E]
Label:       [B]  [C]  [D]  [E]  [_]   <- shifted by 1

If assistant content starts at position 2 ([C]), we want loss on predicting [C], [D], and [E]. But the label for position 2 is [D]—so we need to shift the mask to align with labels:

Original mask:  [0]  [0]  [1]  [1]  [1]   <- "assistant starts at C"
Rolled mask:    [0]  [0]  [0]  [1]  [1]   <- aligns with labels D, E

The pipeline rolls the loss mask by 1 position so it correctly masks the predictions (labels) rather than the inputs.

Truncation behavior (max_doc_tokens):

  • Default (null): No truncation—full sequences are preserved

  • When set: Sequences exceeding the limit are truncated from the end, with the loss mask adjusted accordingly

For implementation details, see src/nemotron/data_prep/chat_sft_processor.py

Packed Sequences#

Why pack sequences?

Individual chat conversations vary in length—some are 50 tokens, others 3000. Without packing, each training sample would require padding to the maximum sequence length, wasting compute on empty tokens. Packing concatenates multiple conversations into a single fixed-length sequence (default 4096 tokens), maximizing GPU utilization.

The packed sequence format stores everything Megatron-Bridge needs for training:

Field

Description

input_ids

Concatenated token IDs from multiple conversations

loss_mask

Rolled mask indicating which positions contribute to loss (see Loss Masking)

seq_start_id

Boundary indices marking where each original conversation starts within the pack

How seq_start_id works:

When multiple conversations are packed together, the model needs to know where one ends and another begins—otherwise attention could “leak” between unrelated conversations. The seq_start_id array marks these boundaries:

Pack: [Conv A tokens] [Conv B tokens] [Conv C tokens]
       ^              ^              ^
seq_start_id: [0,    128,           384]

Megatron-Bridge uses these boundaries for:

  • Variable-length attention: Attention is masked so tokens from Conv A can’t attend to Conv B

  • FlashAttention optimization: Boundaries map to cu_seqlens parameter for efficient packed attention

For packing implementation, see src/nemotron/data_prep/packing/builder.py

Chat Template#

Nemotron 3 Nano supports both reasoning and non-reasoning modes:

  • Multi-Step: Existing reasoning tokens preserved for reuse in subsequent steps

  • Multi-Turn: Reasoning from previous turns dropped when user message introduced

  • Tool Calling: Uses XML-style special tags to reduce character escaping

SFT Data Domains#

Domain

Description

Competition Math

Tool-integrated reasoning with GPT-OSS teachers

Competition Code

OpenCodeReasoning solutions with obfuscation/complication

InfinityByte

Cross-domain code synthesis at model capability boundaries

STEM Reasoning (RQA)

Reasoning Q&A from undergraduate/graduate STEM content

Conversational Tool Use

Multi-turn trajectories with simulated tool execution

Long Context

128k mean token length, 256k hard limit

Formal Proofs

Lean theorem proving with 300k examples

Multilingual

French, Spanish, Italian, German, Japanese

Terminal Use

Terminal operations from Terminal Bench

General Chat

Multi-turn responses from LMSYS and WildChat

Instruction Following

Tulu 3 methodology with verifier filtering

Safety

Refusal behaviors from safety datasets

Software Engineering

GitHub issue resolution trajectories

Science

Physics, chemistry, biology via NeMo Data Designer

For detailed data generation pipelines, see Tech Report Section 3.1.

Data Filtering#

The pipeline applies:

  • Structural checks: Discard malformed examples

  • Pathological repetition filtering: Remove repeated n-grams

  • Consistency filtering: Judge-based action consistency verification

  • Narrative filtering: Remove political/nationalistic narratives

Troubleshooting#

Common data preparation errors and solutions:

Error

Cause

Solution

“# Tools missing” validation failure

Messages contain <tool_call> but system prompt lacks # Tools header

Add a # Tools section in the system prompt before tool definitions

Empty sequences after processing

All tokens masked (no assistant content in conversation)

Verify input data contains assistant responses with actual content

Template rendering mismatch

Tokenizer BPE splits differ from template expectations

Ensure tokenizer model matches the one used during template creation

Sequences truncated excessively

Many conversations exceed max_doc_tokens

Consider increasing max_doc_tokens or pack_size, or chunking long conversations

Debugging tips:

  • Use --sample 100 to test data preparation on a small subset

  • Check metadata.json output for statistics on filtered/truncated sequences

  • Review W&B artifacts for lineage tracking and validation metrics

Hyperparameters#

Parameter

Value

Learning Rate

1e-5

Sequence Length

4096 tokens (pack_size)

Loss Masking

Role-based (assistant tokens only)

Loss Normalization

Per-token (calculate_per_token_loss: true)

Optimizer

AdamW

Total Samples

18M+

calculate_per_token_loss explained:

  • True (default): Loss is normalized by the number of tokens with loss_mask=1 across the batch. Each token contributes equally regardless of which sequence it belongs to.

  • False: Loss is normalized by the number of sequences. Longer sequences (more assistant tokens) contribute more to the gradient.

Per-token normalization is preferred for SFT because it ensures consistent learning signal regardless of conversation length.


Recipe Execution#

Quick Start#

// 1. Prepare data (apply chat templates, tokenize to .npy)
$ uv run nemotron nano3 data prep sft --run YOUR-CLUSTER

// 2. Run SFT
$ uv run nemotron nano3 sft --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution (Megatron-Bridge)#

For direct execution outside this CLI, use the scripts in the Megatron-Bridge repository:

# Clone the repository and checkout the nano-v3 branch
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
git checkout nano-v3

# Run fine-tuning (inside container on compute node)
python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
    --per-split-data-args-path /path/to/data_args.json \
    --tokenizer-model /path/to/tokenizer.model

# With config file overrides
python examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
    --config-file /path/to/overrides.yaml \
    --per-split-data-args-path /path/to/data_args.json \
    --tokenizer-model /path/to/tokenizer.model

See the Megatron-Bridge Nemotron 3 documentation for detailed configuration options.

Configuration#

File

Purpose

config/default.yaml

Production configuration

config/data_prep.yaml

Data preparation settings

config/data_blend_raw.json

Dataset blend definition

Data Preparation#

The data_prep.py script processes OpenAI-format chat data into packed sequences with role-based loss masking. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron nano3 data prep sft [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/stage1_sft/
├── training.npy
├── validation.npy
├── test.npy
└── metadata.json

The output is registered as a W&B Artifact (DataBlendsArtifact-sft) for lineage tracking.

Training#

CLI Command#

uv run nemotron nano3 sft [options] [overrides...]

Option

Description

--run <profile>

Attached—submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detached—submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (CLI Framework)

Override Examples#

# More training iterations
uv run nemotron nano3 sft train.train_iters=5000

# Different learning rate
uv run nemotron nano3 sft optimizer.lr=1e-5

# Load specific pretrained checkpoint
uv run nemotron nano3 sft checkpoint.load=/path/to/pretrain/checkpoint

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    prev["ModelArtifact-pretrain<br/>(from Stage 0)"] --> train
    inst["Instruction Datasets<br/>(OpenAI chat format)"] --> dp["data_prep.py"]
    dp --> data["DataBlendsArtifact-sft<br/>(packed .npy files)"]
    data --> train["train.py"]
    train --> model["ModelArtifact-sft<br/>(fine-tuned checkpoint)"]
    model --> next["Stage 2: RL"]

    style prev fill:#e1f5fe,stroke:#2196f3
    style inst fill:#f3e5f5,stroke:#9c27b0
    style dp fill:#f3e5f5,stroke:#9c27b0
    style data fill:#f3e5f5,stroke:#9c27b0
    style train fill:#f3e5f5,stroke:#9c27b0
    style model fill:#f3e5f5,stroke:#9c27b0
    style next fill:#e8f5e9,stroke:#4caf50
    

Infrastructure#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

Megatron-Core

Distributed training primitives (TP, PP, DP, EP)

GitHub

Megatron-Bridge

Fine-tuning loop, checkpoint loading, loss masking

Docs

Key Features Used#

Feature

Purpose

finetune() entry point

SFT training with pre-loaded checkpoint

Role-based loss masking

Only compute loss on assistant tokens

Mixed precision (BF16)

Memory-efficient training

Gradient checkpointing

Reduce memory footprint

Container#

nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

Next Steps#

After SFT completes, proceed to Stage 2: RL for alignment training.

Reference#