Stage 0: Pretraining#

This stage trains the base Nemotron 3 Nano model from scratch on 25 trillion tokens using Megatron-Bridge.

Open-Source Data Only: This recipe uses exclusively open-sourced training data from the Nemotron Pre-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-CC-Math-v1, Nemotron-CC-v2, Nemotron-CC-v2.1, and Nemotron-Pretraining-Specialized-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.

Quick Start#

// 1. Prepare data (tokenize to bin/idx format)
$ uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER

// 2. Run pretraining
$ uv run nemotron nano3 pretrain --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution#

Inside a container on a compute node:

# Data preparation
uv run python data_prep.py --config config/data_prep.yaml

# Training (single node)
uv run python train.py --config config/default.yaml

# Training (distributed)
uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml

Configuration#

File

Purpose

config/default.yaml

Production configuration

config/data_prep.yaml

Data preparation settings

config/data_blend_raw.json

Dataset blend definition

Data Preparation#

The data_prep.py script tokenizes raw text datasets into Megatron’s binary format. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron nano3 data prep pretrain [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/nano3/stage0_pretrain/
├── train/
│   ├── data_00000.bin
│   ├── data_00000.idx
│   └── ...
├── valid/
├── test/
└── blend.json

The output is registered as a W&B Artifact (DataBlendsArtifact-pretrain) for lineage tracking.

Training#

CLI Command#

uv run nemotron nano3 pretrain [options] [overrides...]

Option

Description

--run <profile>

Attached—submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detached—submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (CLI Framework)

Override Examples#

# More training iterations
uv run nemotron nano3 pretrain train.train_iters=5000

# Larger batch size
uv run nemotron nano3 pretrain train.global_batch_size=64

# Different checkpoint location
uv run nemotron nano3 pretrain checkpoint.save=/path/to/checkpoints

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    raw["Raw Text Data"] --> dp["data_prep.py"]
    dp --> data["DataBlendsArtifact-pretrain<br/>(bin/idx files + blend.json)"]
    data --> train["train.py"]
    train --> model["ModelArtifact-pretrain<br/>(checkpoint)"]
    model --> next["Stage 1: SFT"]

    style raw fill:#e1f5fe,stroke:#2196f3
    style dp fill:#e1f5fe,stroke:#2196f3
    style data fill:#e1f5fe,stroke:#2196f3
    style train fill:#e1f5fe,stroke:#2196f3
    style model fill:#e1f5fe,stroke:#2196f3
    style next fill:#f3e5f5,stroke:#9c27b0
    

Methodology#

For complete methodology, see Tech Report Section 2.

Pretraining Data#

The pretraining corpus comprises four main dataset families:

Dataset Family

Description

Nemotron-CC-Code-v1

High-quality code from Common Crawl

Nemotron-Pretraining-Code-v2

GitHub code with student-teacher generation

Nemotron-CC-v2.1

General English web crawl with synthetic rephrasing

Nemotron-Pretrain-Specialized-v1

Synthetic STEM, math textbooks, scientific coding

Data spans 15 categories including web crawl (various quality tiers), code, math, academic, and multilingual content.

For dataset details, see Tech Report Section 2.2.

Data Mixture#

Two-phase curriculum approach:

Phase

Tokens

Focus

Phase 1

23.5T

High diversity across web, code, math, multilingual

Phase 2

1.5T

High-quality data with curated sources

For mixture strategy, see Tech Report Section 2.3.

Hyperparameters#

Parameter

Value

Total Tokens

25 trillion

Batch Size

8192 sequences

Sequence Length

4096 tokens

Learning Rate

1e-4 (stable) → 1e-5 (decay)

Warmup

80% of training (20T tokens)

Optimizer

AdamW (β₁=0.9, β₂=0.95)

Weight Decay

0.1

MoE Load Balancing

DeepSeek aux-loss-free strategy

For hyperparameter rationale, see Tech Report Section 2.4.

Long-Context Extension#

The LC-Phase extends context to 1M tokens after main pretraining:

Parameter

Value

Duration

121 billion tokens

Learning Rate

1e-5 (constant)

Global Batch Size

48

Parallelism

8-way context/tensor/expert, 4-way pipeline

For long-context methodology, see Tech Report Section 2.5.

Open-Source Data#

Note: This recipe trains exclusively on the open-sourced subset of pretraining data. Results will differ from the tech report benchmarks, which used additional proprietary data.

NVIDIA AI Stack#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

Megatron-Core

Distributed training primitives (TP, PP, DP, EP, CP, SP)

GitHub

Megatron-Bridge

Model definitions, training loop, checkpoint management

Docs

Parallelism Configuration#

Pretraining uses multiple parallelism strategies for efficient scaling:

Parallelism

Config Key

Description

Tensor (TP)

model.tensor_model_parallel_size

Split weight matrices across GPUs

Pipeline (PP)

model.pipeline_model_parallel_size

Split layers into pipeline stages

Data (DP)

Automatic

Replicate model, distribute batches

Expert (EP)

model.expert_model_parallel_size

Distribute MoE experts across GPUs

Context (CP)

model.context_parallel_size

Distribute long sequences

Sequence (SP)

model.sequence_parallel

Distribute LayerNorm/Dropout activations

Container#

nvcr.io/nvidia/nemo:25.11.nemotron_3_nano

Next Steps#

After pretraining completes, proceed to Stage 1: SFT for instruction tuning.

Reference#