Stage 2: Reinforcement Learning (RL)#
This stage aligns the instruction-tuned model using GRPO (Group Relative Policy Optimization) with NeMo-RL.
Open-Source Data Only: This recipe uses exclusively open-sourced RL data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe uses the Nemotron-3-Nano-RL-Training-Blend dataset. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.
Quick Start#
// 1. Prepare data (convert to JSONL format)
$ uv run nemotron nano3 data prep rl --run YOUR-CLUSTER
// 2. Run RL training
$ uv run nemotron nano3 rl --run YOUR-CLUSTER
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Direct Script Execution#
Inside a container on a compute node (requires NeMo-RL and Ray):
# Data preparation
uv run python data_prep.py --config config/data_prep.yaml
# Training (Ray initialized internally)
uv run python train.py --config config/grpo_nanov3.yaml
Configuration#
File |
Purpose |
|---|---|
|
Production GRPO configuration |
|
Data preparation settings |
|
RL dataset blend |
Data Preparation#
The data_prep.py script converts datasets to JSONL format compatible with NeMo-RL’s NeMo-Gym interface. See Data Preparation Module for detailed documentation.
CLI Command#
uv run nemotron nano3 data prep rl [options]
Option |
Description |
|---|---|
|
Execute on Slurm via NeMo-Run |
|
Limit rows per dataset (for testing) |
|
Force re-run, ignoring cache |
Output#
output/nano3/stage2_rl/
├── train/
│ └── data.jsonl
├── val/
│ └── data.jsonl
├── test/
│ └── data.jsonl
└── manifest.json
The output is registered as a W&B Artifact (DataBlendsArtifact-rl) for lineage tracking.
Training#
CLI Command#
uv run nemotron nano3 rl [options] [overrides...]
Option |
Description |
|---|---|
|
Attached—submits and waits, streaming logs (NeMo-Run) |
|
Detached—submits and exits immediately (NeMo-Run) |
|
Preview execution plan |
|
Override config values (CLI Framework) |
Override Examples#
# More iterations
uv run nemotron nano3 rl grpo.num_iterations=200
# Different temperature
uv run nemotron nano3 rl policy.generation.temperature=0.8
# Different learning rate
uv run nemotron nano3 rl grpo.learning_rate=5e-7
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mem = "0"
exclusive = true
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
prev["ModelArtifact-sft<br/>(from Stage 1)"] --> train
rl["RL Datasets<br/>(preference/reward data)"] --> dp["data_prep.py"]
dp --> data["DataBlendsArtifact-rl<br/>(JSONL files)"]
data --> train["train.py<br/>(GRPO with NeMo-RL)"]
train --> model["ModelArtifact-rl<br/>(final aligned model)"]
style prev fill:#f3e5f5,stroke:#9c27b0
style rl fill:#e8f5e9,stroke:#4caf50
style dp fill:#e8f5e9,stroke:#4caf50
style data fill:#e8f5e9,stroke:#4caf50
style train fill:#e8f5e9,stroke:#4caf50
style model fill:#e8f5e9,stroke:#4caf50
Methodology#
For complete methodology, see Tech Report Section 3.2.
The RL pipeline consists of three components:
RLVR — Multi-environment training with verifiable rewards
RLHF with GenRM — Generative reward model-based alignment
DPO — Preference learning to reduce tool hallucination
Multi-Environment RLVR#
Training uses 7 reward environments through NeMo-Gym:
Environment |
Description |
|---|---|
Competition Math |
Mathematical reasoning (DAPO, SkyWorks math) |
Competition Coding |
Code correctness with test case execution |
Question Answering |
STEM multiple choice verification |
Structured Outputs |
JSON schema adherence |
Instruction Following |
IFEval, Multi-Challenge compliance |
Long Context |
256k token multi-document synthesis |
Agentic Tool Use |
Workplace Assistant, Multi-Turn Agent |
Training on all environments simultaneously provides stable gains without co-reward degradation.
For GRPO algorithm details, see Tech Report Section 3.2.
GenRM (RLHF)#
Generative reward models use circular comparison strategy (N comparisons instead of O(N²)) with length-normalized reward adjustment.
Parameter |
Value |
|---|---|
Prompts per batch |
128 |
Responses per prompt |
16 |
Comparison strategy |
Circular |
Length bonus α |
0.5 |
For GenRM training details, see Tech Report Section 3.2.
DPO for Tool Hallucination#
DPO reduces hallucinated tool usage with minimal computational overhead:
Metric |
Before DPO |
After DPO |
|---|---|---|
AIME25 Accuracy |
80.88% |
84.58% |
Hallucination Rate |
8.33% |
0.7% |
For DPO methodology, see Tech Report Appendix C.
GRPO Hyperparameters#
Parameter |
Value |
|---|---|
Prompts per step |
128 |
Generations per prompt |
16 |
Max Generation Length |
49K tokens |
Epsilon Filtering |
Cosine annealing with 4% limit |
MoE Load Balancing |
DeepSeek aux-loss-free strategy |
Reasoning Control#
The model supports:
Reasoning on/off control — Strip reasoning from 10% of samples
Token budget control — Truncate 3% of reasoning traces to different budgets
Requirements#
GPU nodes: Recommended 8 GPUs per node (H100)
Ray cluster: Automatically initialized for distributed execution
NVIDIA AI Stack#
This stage uses the following components from the NVIDIA AI Stack:
Component |
Role |
Documentation |
|---|---|---|
GRPO algorithm, policy training, reward computation |
||
Distributed actor coordination |
||
vLLM |
Fast rollout generation |
Key Features Used#
Feature |
Purpose |
|---|---|
GRPO algorithm |
Group Relative Policy Optimization with clipped gradients |
Multi-environment training |
Simultaneous training across 7 reward environments |
NeMo-Gym |
Reward environments (math, code, tool-use) |
DTensor backend |
FSDP2-based distributed training |
Architecture#
NeMo-RL uses a Ray-based actor model:
Actor |
Function |
|---|---|
Policy Model |
Trainable policy weights |
Generator |
vLLM-backed rollout generation |
Reward Model |
Environment-specific reward computation |
Reference Model |
KL divergence regularization |
Container#
nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
Open-Source Data#
Note: This recipe trains exclusively on the open-sourced subset of RL data. Results will differ from the tech report benchmarks, which used additional proprietary data.
Reference#
Tech Report Section 3.2 — RL methodology
NVIDIA AI Stack — NeMo-RL documentation
Artifact Lineage — W&B artifact system
Stage 0: Pretraining — Pretrain the base model
Stage 1: SFT — Instruction tuning
Recipe Source:
src/nemotron/recipes/nano3/stage2_rl/— Implementation details