Stage 0: Pretraining#

This stage trains the base Nemotron 3 Nano model from scratch on 25 trillion tokens using Megatron-Bridge.

Nemotron 3 Nano is a hybrid Mamba-Transformer-MoE model with 52 layers, combining state-space models for efficiency, attention for global context, and mixture-of-experts for capacity. Key innovations include aux-loss-free MoE balancing and a two-phase data curriculum.

Open-Source Data Only: This recipe uses exclusively open-sourced training data from the Nemotron Pre-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-CC-Math-v1, Nemotron-CC-v2, Nemotron-CC-v2.1, and Nemotron-Pretraining-Specialized-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.


Training Methodology#

Training Framework: Pretraining is implemented using Megatron-Bridge, which provides the training loop, distributed training primitives, and checkpoint management. See Training Entry Points for details on how pretrain() works.

For complete methodology, see Tech Report Section 2.

Model Architecture#

Nemotron 3 Nano uses a hybrid Mamba-Transformer-MoE architecture with 52 layers:

Layer Type

Count

Role

Mamba-2

23

Efficient sequence modeling via state space

Attention

6

Global context at key positions

MoE

23

Sparse computation with 8 experts per layer

The hybrid pattern interleaves these layer types to balance efficiency and capability:

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
    subgraph layers["52 Layers"]
        direction LR
        m1["Mamba-2"] --> m2["Mamba-2"] --> a1["Attention"]
        a1 --> moe1["MoE"] --> m3["Mamba-2"] --> m4["..."]
    end

    style m1 fill:#e8f5e9,stroke:#4caf50
    style m2 fill:#e8f5e9,stroke:#4caf50
    style m3 fill:#e8f5e9,stroke:#4caf50
    style a1 fill:#e3f2fd,stroke:#2196f3
    style moe1 fill:#fff3e0,stroke:#ff9800
    

Key design choices:

  • Mamba-2 layers provide linear-time sequence processing, enabling efficient inference on long contexts

  • Attention layers are placed at strategic intervals (every ~8 layers) for global information mixing

  • MoE layers use 128 routed experts plus 1 shared expert, with 6 experts activated per token, keeping active parameters at ~3.5B while total parameters reach ~31.6B

For architecture rationale, see Tech Report Section 2.1.

For implementation details, see Megatron-Bridge Nemotron 3.

Pretraining Data#

The pretraining corpus comprises four main dataset families:

Dataset Family

Description

Nemotron-CC-Code-v1

High-quality code from Common Crawl

Nemotron-Pretraining-Code-v2

GitHub code with student-teacher generation

Nemotron-CC-v2.1

General English web crawl with synthetic rephrasing

Nemotron-Pretrain-Specialized-v1

Synthetic STEM, math textbooks, scientific coding

Data spans 15 categories including web crawl (various quality tiers), code, math, academic, and multilingual content.

For dataset details, see Tech Report Section 2.2.

Data Mixture#

Training follows a two-phase curriculum that transitions from broad coverage to focused quality:

Phase

Tokens

Focus

Strategy

Phase 1

23.5T

Diversity

Broad coverage across all data sources

Phase 2

1.5T

Quality

Increased weight on high-quality and STEM data

Phase 1: Foundation Building

  • Uses all dataset families with balanced weights

  • Emphasizes diversity: web (multiple quality tiers), code, math, multilingual

  • Builds broad knowledge base and language understanding

Phase 2: Quality Refinement

  • Increases sampling from high-quality sources:

    • High-Quality and High-Quality-Synthetic subsets

    • Nemotron-Pretraining-Specialized-v1 (STEM, math textbooks, scientific coding)

  • Reduces low-quality web content

  • Sharpens model capabilities on curated data

For mixture strategy details, see Tech Report Section 2.3.

Hyperparameters#

Parameter

Value

Total Tokens

25 trillion

Batch Size

3,072 sequences

Sequence Length

8,192 tokens

Peak Learning Rate

1e-3

Minimum Learning Rate

1e-5

Optimizer

AdamW (β₁=0.9, β₂=0.95)

Weight Decay

0.1

MoE Load Balancing

DeepSeek aux-loss-free strategy

Learning Rate Schedule:

Phase

Tokens

LR

Warmup

8.4B

0 → 1e-3

Stable

20T (80%)

1e-3

Decay

5T (20%)

1e-3 → 1e-5

The warmup is token-based (8.4B tokens), not percentage-based. The stable phase maintains peak LR for 80% of training before cosine decay.

For hyperparameter rationale, see Tech Report Section 2.4.

MoE Load Balancing#

Nemotron 3 Nano uses the aux-loss-free load balancing strategy from DeepSeek, avoiding the auxiliary losses traditionally used to balance expert utilization.

Why aux-loss-free?

Traditional MoE training adds an auxiliary loss term to encourage balanced routing. However, this:

  • Adds a hyperparameter (aux loss weight) that’s hard to tune

  • Can conflict with the main training objective

  • May hurt model quality at scale

How it works:

Instead of auxiliary losses, the router uses bias terms that are adjusted dynamically:

  • Track expert utilization over a sliding window

  • Increase bias for underutilized experts (more tokens routed to them)

  • Decrease bias for overloaded experts

  • No gradient flows through the bias adjustment

This achieves balanced expert utilization without interfering with the main loss function.

Long-Context Extension#

The LC-Phase extends context to 1M tokens after main pretraining:

Parameter

Value

Duration

121 billion tokens

Learning Rate

1e-5 (constant)

Global Batch Size

48

Parallelism

8-way context/tensor/expert, 4-way pipeline

For long-context methodology, see Tech Report Section 2.5.


Recipe Execution#

Quick Start#

// 1. Prepare data (tokenize to bin/idx format)
$ uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER

// 2. Run pretraining
$ uv run nemotron nano3 pretrain --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution (Megatron-Bridge)#

For direct execution outside this CLI, use the scripts in the Megatron-Bridge repository:

# Clone the repository and checkout the nano-v3 branch
git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
cd Megatron-Bridge
git checkout nano-v3

# Run pretraining (inside container on compute node)
python examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
    --per-split-data-args-path /path/to/data_args.json \
    --tokenizer-model /path/to/tokenizer.model

# With config file overrides
python examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
    --config-file /path/to/overrides.yaml \
    --per-split-data-args-path /path/to/data_args.json \
    --tokenizer-model /path/to/tokenizer.model

See the Megatron-Bridge Nemotron 3 documentation for detailed configuration options.

Configuration#

File

Purpose

config/default.yaml

Production configuration

config/data_prep.yaml

Data preparation settings

config/data_blend_raw.json

Dataset blend definition

Blend Configuration

Data blends are defined in config/data_prep/data_blend_raw.json. Each entry specifies:

{
  "name": "dataset-name",
  "path": "hf://nvidia/...",
  "subset": "subset-name",
  "weight": 1.0
}

Weights control sampling probability during data preparation. Phase transitions are implemented by using different blend configurations.

Data Preparation#

The data_prep.py script tokenizes raw text datasets into Megatron’s binary format. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron nano3 data prep pretrain [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/nano3/stage0_pretrain/
├── train/
│   ├── data_00000.bin
│   ├── data_00000.idx
│   └── ...
├── valid/
├── test/
└── blend.json

The output is registered as a W&B Artifact (DataBlendsArtifact-pretrain) for lineage tracking.

Training#

CLI Command#

uv run nemotron nano3 pretrain [options] [overrides...]

Option

Description

--run <profile>

Attached—submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detached—submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (CLI Framework)

Override Examples#

# More training iterations
uv run nemotron nano3 pretrain train.train_iters=5000

# Larger batch size
uv run nemotron nano3 pretrain train.global_batch_size=64

# Different checkpoint location
uv run nemotron nano3 pretrain checkpoint.save=/path/to/checkpoints

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Checkpoint & Resume#

Training automatically saves checkpoints at regular intervals. To resume from a checkpoint:

# Resume from a specific checkpoint
uv run nemotron nano3 pretrain checkpoint.load=/path/to/checkpoint

# Resume from latest checkpoint in a directory
uv run nemotron nano3 pretrain checkpoint.load=/path/to/checkpoints/

Checkpoint Configuration:

Option

Description

checkpoint.save

Directory for saving checkpoints

checkpoint.load

Path to checkpoint for resuming

checkpoint.save_interval

Steps between saves (default: 1000)

Checkpoints use Megatron’s distributed format, which handles model parallelism automatically. Each checkpoint contains model weights, optimizer state, and training progress.

For checkpoint format and advanced options, see Megatron-Bridge Checkpointing.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    raw["Raw Text Data"] --> dp["data_prep.py"]
    dp --> data["DataBlendsArtifact-pretrain<br/>(bin/idx files + blend.json)"]
    data --> train["train.py"]
    train --> model["ModelArtifact-pretrain<br/>(checkpoint)"]
    model --> next["Stage 1: SFT"]

    style raw fill:#e1f5fe,stroke:#2196f3
    style dp fill:#e1f5fe,stroke:#2196f3
    style data fill:#e1f5fe,stroke:#2196f3
    style train fill:#e1f5fe,stroke:#2196f3
    style model fill:#e1f5fe,stroke:#2196f3
    style next fill:#f3e5f5,stroke:#9c27b0
    

Infrastructure#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

Megatron-Core

Distributed training primitives (TP, PP, DP, EP, CP, SP)

GitHub

Megatron-Bridge

Model definitions, training loop, checkpoint management

Docs

Parallelism Configuration#

Pretraining uses multiple parallelism strategies for efficient scaling. The specific values differ between main pretraining and long-context extension:

Parallelism

Main Pretraining

Long-Context (LC)

Config Key

Tensor (TP)

8

8

model.tensor_model_parallel_size

Pipeline (PP)

1

4

model.pipeline_model_parallel_size

Expert (EP)

8

8

model.expert_model_parallel_size

Context (CP)

1

8

model.context_parallel_size

Sequence (SP)

Yes

Yes

model.sequence_parallel

Data (DP)

Auto

Auto

Computed from world size

Why the difference?

  • Main pretraining uses 4K sequences, so context parallelism (CP=1) isn’t needed

  • Long-context extension handles up to 1M tokens, requiring CP=8 to distribute sequences across GPUs

  • Pipeline parallelism increases in LC phase (PP=4) to handle larger activation memory

For parallelism concepts, see NVIDIA AI Stack: Parallelism.

Container#

nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

Next Steps#

After pretraining completes, proceed to Stage 1: SFT for instruction tuning.

Reference#