Stage 0: Pretraining#
This stage trains the base Nemotron 3 Nano model from scratch on 25 trillion tokens using Megatron-Bridge.
Open-Source Data Only: This recipe uses exclusively open-sourced training data from the Nemotron Pre-training Datasets collection, which is a subset of the full data used to train the released model. The recipe includes datasets from Nemotron-CC-Math-v1, Nemotron-CC-v2, Nemotron-CC-v2.1, and Nemotron-Pretraining-Specialized-v1. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.
Quick Start#
// 1. Prepare data (tokenize to bin/idx format)
$ uv run nemotron nano3 data prep pretrain --run YOUR-CLUSTER
// 2. Run pretraining
$ uv run nemotron nano3 pretrain --run YOUR-CLUSTER
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Direct Script Execution#
Inside a container on a compute node:
# Data preparation
uv run python data_prep.py --config config/data_prep.yaml
# Training (single node)
uv run python train.py --config config/default.yaml
# Training (distributed)
uv run torchrun --nproc_per_node=8 train.py --config config/default.yaml
Configuration#
File |
Purpose |
|---|---|
|
Production configuration |
|
Data preparation settings |
|
Dataset blend definition |
Data Preparation#
The data_prep.py script tokenizes raw text datasets into Megatron’s binary format. See Data Preparation Module for detailed documentation.
CLI Command#
uv run nemotron nano3 data prep pretrain [options]
Option |
Description |
|---|---|
|
Execute on Slurm via NeMo-Run |
|
Limit rows per dataset (for testing) |
|
Force re-run, ignoring cache |
Output#
output/nano3/stage0_pretrain/
├── train/
│ ├── data_00000.bin
│ ├── data_00000.idx
│ └── ...
├── valid/
├── test/
└── blend.json
The output is registered as a W&B Artifact (DataBlendsArtifact-pretrain) for lineage tracking.
Training#
CLI Command#
uv run nemotron nano3 pretrain [options] [overrides...]
Option |
Description |
|---|---|
|
Attached—submits and waits, streaming logs (NeMo-Run) |
|
Detached—submits and exits immediately (NeMo-Run) |
|
Preview execution plan |
|
Override config values (CLI Framework) |
Override Examples#
# More training iterations
uv run nemotron nano3 pretrain train.train_iters=5000
# Larger batch size
uv run nemotron nano3 pretrain train.global_batch_size=64
# Different checkpoint location
uv run nemotron nano3 pretrain checkpoint.save=/path/to/checkpoints
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
raw["Raw Text Data"] --> dp["data_prep.py"]
dp --> data["DataBlendsArtifact-pretrain<br/>(bin/idx files + blend.json)"]
data --> train["train.py"]
train --> model["ModelArtifact-pretrain<br/>(checkpoint)"]
model --> next["Stage 1: SFT"]
style raw fill:#e1f5fe,stroke:#2196f3
style dp fill:#e1f5fe,stroke:#2196f3
style data fill:#e1f5fe,stroke:#2196f3
style train fill:#e1f5fe,stroke:#2196f3
style model fill:#e1f5fe,stroke:#2196f3
style next fill:#f3e5f5,stroke:#9c27b0
Methodology#
For complete methodology, see Tech Report Section 2.
Pretraining Data#
The pretraining corpus comprises four main dataset families:
Dataset Family |
Description |
|---|---|
Nemotron-CC-Code-v1 |
High-quality code from Common Crawl |
Nemotron-Pretraining-Code-v2 |
GitHub code with student-teacher generation |
Nemotron-CC-v2.1 |
General English web crawl with synthetic rephrasing |
Nemotron-Pretrain-Specialized-v1 |
Synthetic STEM, math textbooks, scientific coding |
Data spans 15 categories including web crawl (various quality tiers), code, math, academic, and multilingual content.
For dataset details, see Tech Report Section 2.2.
Data Mixture#
Two-phase curriculum approach:
Phase |
Tokens |
Focus |
|---|---|---|
Phase 1 |
23.5T |
High diversity across web, code, math, multilingual |
Phase 2 |
1.5T |
High-quality data with curated sources |
For mixture strategy, see Tech Report Section 2.3.
Hyperparameters#
Parameter |
Value |
|---|---|
Total Tokens |
25 trillion |
Batch Size |
8192 sequences |
Sequence Length |
4096 tokens |
Learning Rate |
1e-4 (stable) → 1e-5 (decay) |
Warmup |
80% of training (20T tokens) |
Optimizer |
AdamW (β₁=0.9, β₂=0.95) |
Weight Decay |
0.1 |
MoE Load Balancing |
DeepSeek aux-loss-free strategy |
For hyperparameter rationale, see Tech Report Section 2.4.
Long-Context Extension#
The LC-Phase extends context to 1M tokens after main pretraining:
Parameter |
Value |
|---|---|
Duration |
121 billion tokens |
Learning Rate |
1e-5 (constant) |
Global Batch Size |
48 |
Parallelism |
8-way context/tensor/expert, 4-way pipeline |
For long-context methodology, see Tech Report Section 2.5.
Open-Source Data#
Note: This recipe trains exclusively on the open-sourced subset of pretraining data. Results will differ from the tech report benchmarks, which used additional proprietary data.
NVIDIA AI Stack#
This stage uses the following components from the NVIDIA AI Stack:
Component |
Role |
Documentation |
|---|---|---|
Distributed training primitives (TP, PP, DP, EP, CP, SP) |
||
Model definitions, training loop, checkpoint management |
Parallelism Configuration#
Pretraining uses multiple parallelism strategies for efficient scaling:
Parallelism |
Config Key |
Description |
|---|---|---|
Tensor (TP) |
|
Split weight matrices across GPUs |
Pipeline (PP) |
|
Split layers into pipeline stages |
Data (DP) |
Automatic |
Replicate model, distribute batches |
Expert (EP) |
|
Distribute MoE experts across GPUs |
Context (CP) |
|
Distribute long sequences |
Sequence (SP) |
|
Distribute LayerNorm/Dropout activations |
Container#
nvcr.io/nvidia/nemo:25.11.nemotron_3_nano
Next Steps#
After pretraining completes, proceed to Stage 1: SFT for instruction tuning.
Reference#
Tech Report Section 2 — Pretraining methodology
NVIDIA AI Stack — Megatron-Core, Megatron-Bridge documentation
Artifact Lineage — W&B artifact system
Recipe Source:
src/nemotron/recipes/nano3/stage0_pretrain/— Implementation details