Stage 1: Supervised Fine-Tuning (SFT)#

This stage fine-tunes the pretrained model for instruction following using Megatron-Bridge.

Open-Source Data Only: This recipe uses exclusively open-sourced SFT data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-Science-v1, Nemotron-Instruction-Following-Chat-v1, Nemotron-Math-Proofs-v1, Nemotron-SWE-v1, Nemotron-Agentic-v1, and Nemotron-Competitive-Programming-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.

Quick Start#

// 1. Prepare data (apply chat templates, tokenize to .npy)
$ uv run nemotron nano3 data prep sft --run YOUR-CLUSTER

// 2. Run SFT
$ uv run nemotron nano3 sft --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution#

Inside a container on a compute node:

# Data preparation
uv run python data_prep.py --config config/data_prep.yaml

# Training (single node)
uv run python train.py --config config/default.yaml

# Training (distributed)
uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml

Configuration#

File

Purpose

config/default.yaml

Production configuration

config/data_prep.yaml

Data preparation settings

config/data_blend_raw.json

Dataset blend definition

Data Preparation#

The data_prep.py script processes OpenAI-format chat data into packed sequences with role-based loss masking. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron nano3 data prep sft [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/stage1_sft/
├── training.npy
├── validation.npy
├── test.npy
└── metadata.json

The output is registered as a W&B Artifact (DataBlendsArtifact-sft) for lineage tracking.

Training#

CLI Command#

uv run nemotron nano3 sft [options] [overrides...]

Option

Description

--run <profile>

Attached—submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detached—submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (CLI Framework)

Override Examples#

# More training iterations
uv run nemotron nano3 sft train.train_iters=5000

# Different learning rate
uv run nemotron nano3 sft optimizer.lr=1e-5

# Load specific pretrained checkpoint
uv run nemotron nano3 sft checkpoint.load=/path/to/pretrain/checkpoint

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    prev["ModelArtifact-pretrain<br/>(from Stage 0)"] --> train
    inst["Instruction Datasets<br/>(OpenAI chat format)"] --> dp["data_prep.py"]
    dp --> data["DataBlendsArtifact-sft<br/>(packed .npy files)"]
    data --> train["train.py"]
    train --> model["ModelArtifact-sft<br/>(fine-tuned checkpoint)"]
    model --> next["Stage 2: RL"]

    style prev fill:#e1f5fe,stroke:#2196f3
    style inst fill:#f3e5f5,stroke:#9c27b0
    style dp fill:#f3e5f5,stroke:#9c27b0
    style data fill:#f3e5f5,stroke:#9c27b0
    style train fill:#f3e5f5,stroke:#9c27b0
    style model fill:#f3e5f5,stroke:#9c27b0
    style next fill:#e8f5e9,stroke:#4caf50
    

Methodology#

For complete methodology, see Tech Report Section 3.1.

Chat Template#

Nemotron 3 Nano supports both reasoning and non-reasoning modes:

  • Multi-Step: Existing reasoning tokens preserved for reuse in subsequent steps

  • Multi-Turn: Reasoning from previous turns dropped when user message introduced

  • Tool Calling: Uses XML-style special tags to reduce character escaping

SFT Data Domains#

Domain

Description

Competition Math

Tool-integrated reasoning with GPT-OSS teachers

Competition Code

OpenCodeReasoning solutions with obfuscation/complication

InfinityByte

Cross-domain code synthesis at model capability boundaries

STEM Reasoning (RQA)

Reasoning Q&A from undergraduate/graduate STEM content

Conversational Tool Use

Multi-turn trajectories with simulated tool execution

Long Context

128k mean token length, 256k hard limit

Formal Proofs

Lean theorem proving with 300k examples

Multilingual

French, Spanish, Italian, German, Japanese

Terminal Use

Terminal operations from Terminal Bench

General Chat

Multi-turn responses from LMSYS and WildChat

Instruction Following

Tülu 3 methodology with verifier filtering

Safety

Refusal behaviors from safety datasets

Software Engineering

GitHub issue resolution trajectories

Science

Physics, chemistry, biology via NeMo Data Designer

For detailed data generation pipelines, see Tech Report Section 3.1.

Data Filtering#

The pipeline applies:

  • Structural checks: Discard malformed examples

  • Pathological repetition filtering: Remove repeated n-grams

  • Consistency filtering: Judge-based action consistency verification

  • Narrative filtering: Remove political/nationalistic narratives

Hyperparameters#

Parameter

Value

Learning Rate

1e-5

Sequence Length

4096 tokens (pack_size)

Loss Masking

Role-based (assistant tokens only)

Optimizer

AdamW

Total Samples

18M+

Open-Source Data#

Note: This recipe trains exclusively on the open-sourced subset of SFT data. Results will differ from the tech report benchmarks, which used additional proprietary data.

NVIDIA AI Stack#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

Megatron-Core

Distributed training primitives (TP, PP, DP, EP)

GitHub

Megatron-Bridge

Fine-tuning loop, checkpoint loading, loss masking

Docs

Key Features Used#

Feature

Purpose

finetune() entry point

SFT training with pre-loaded checkpoint

Role-based loss masking

Only compute loss on assistant tokens

Mixed precision (BF16)

Memory-efficient training

Gradient checkpointing

Reduce memory footprint

Container#

nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

Next Steps#

After SFT completes, proceed to Stage 2: RL for alignment training.

Reference#