Stage 2: Reinforcement Learning (RL)#

This stage aligns the instruction-tuned model using GRPO (Group Relative Policy Optimization) with NeMo-RL.

Open-Source Data Only: This recipe uses exclusively open-sourced RL data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe uses the Nemotron-3-Nano-RL-Training-Blend dataset. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.

Quick Start#

// 1. Prepare data (convert to JSONL format)
$ uv run nemotron nano3 data prep rl --run YOUR-CLUSTER

// 2. Run RL training
$ uv run nemotron nano3 rl --run YOUR-CLUSTER

Note: The --run YOUR-CLUSTER flag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.

Direct Script Execution#

Inside a container on a compute node (requires NeMo-RL and Ray):

# Data preparation
uv run python data_prep.py --config config/data_prep.yaml

# Training (Ray initialized internally)
uv run python train.py --config config/grpo_nanov3.yaml

Configuration#

File

Purpose

config/grpo_nanov3.yaml

Production GRPO configuration

config/data_prep.yaml

Data preparation settings

config/data_blend_raw.json

RL dataset blend

Data Preparation#

The data_prep.py script converts datasets to JSONL format compatible with NeMo-RL’s NeMo-Gym interface. See Data Preparation Module for detailed documentation.

CLI Command#

uv run nemotron nano3 data prep rl [options]

Option

Description

--run <profile>

Execute on Slurm via NeMo-Run

--sample N

Limit rows per dataset (for testing)

--force

Force re-run, ignoring cache

Output#

output/nano3/stage2_rl/
├── train/
│   └── data.jsonl
├── val/
│   └── data.jsonl
├── test/
│   └── data.jsonl
└── manifest.json

The output is registered as a W&B Artifact (DataBlendsArtifact-rl) for lineage tracking.

Training#

CLI Command#

uv run nemotron nano3 rl [options] [overrides...]

Option

Description

--run <profile>

Attached—submits and waits, streaming logs (NeMo-Run)

--batch <profile>

Detached—submits and exits immediately (NeMo-Run)

--dry-run

Preview execution plan

key=value

Override config values (CLI Framework)

Override Examples#

# More iterations
uv run nemotron nano3 rl grpo.num_iterations=200

# Different temperature
uv run nemotron nano3 rl policy.generation.temperature=0.8

# Different learning rate
uv run nemotron nano3 rl grpo.learning_rate=5e-7

Running with NeMo-Run#

Configure execution profiles in env.toml:

[wandb]
project = "nemotron"
entity = "YOUR-TEAM"

[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 2
ntasks_per_node = 8
gpus_per_node = 8
mem = "0"
exclusive = true
mounts = ["/lustre:/lustre"]

See Execution through NeMo-Run for complete configuration options.

Artifact Lineage#

        %%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
    prev["ModelArtifact-sft<br/>(from Stage 1)"] --> train
    rl["RL Datasets<br/>(preference/reward data)"] --> dp["data_prep.py"]
    dp --> data["DataBlendsArtifact-rl<br/>(JSONL files)"]
    data --> train["train.py<br/>(GRPO with NeMo-RL)"]
    train --> model["ModelArtifact-rl<br/>(final aligned model)"]

    style prev fill:#f3e5f5,stroke:#9c27b0
    style rl fill:#e8f5e9,stroke:#4caf50
    style dp fill:#e8f5e9,stroke:#4caf50
    style data fill:#e8f5e9,stroke:#4caf50
    style train fill:#e8f5e9,stroke:#4caf50
    style model fill:#e8f5e9,stroke:#4caf50
    

Methodology#

For complete methodology, see Tech Report Section 3.2.

The RL pipeline consists of three components:

  1. RLVR — Multi-environment training with verifiable rewards

  2. RLHF with GenRM — Generative reward model-based alignment

  3. DPO — Preference learning to reduce tool hallucination

Multi-Environment RLVR#

Training uses 7 reward environments through NeMo-Gym:

Environment

Description

Competition Math

Mathematical reasoning (DAPO, SkyWorks math)

Competition Coding

Code correctness with test case execution

Question Answering

STEM multiple choice verification

Structured Outputs

JSON schema adherence

Instruction Following

IFEval, Multi-Challenge compliance

Long Context

256k token multi-document synthesis

Agentic Tool Use

Workplace Assistant, Multi-Turn Agent

Training on all environments simultaneously provides stable gains without co-reward degradation.

For GRPO algorithm details, see Tech Report Section 3.2.

GenRM (RLHF)#

Generative reward models use circular comparison strategy (N comparisons instead of O(N²)) with length-normalized reward adjustment.

Parameter

Value

Prompts per batch

128

Responses per prompt

16

Comparison strategy

Circular

Length bonus α

0.5

For GenRM training details, see Tech Report Section 3.2.

DPO for Tool Hallucination#

DPO reduces hallucinated tool usage with minimal computational overhead:

Metric

Before DPO

After DPO

AIME25 Accuracy

80.88%

84.58%

Hallucination Rate

8.33%

0.7%

For DPO methodology, see Tech Report Appendix C.

GRPO Hyperparameters#

Parameter

Value

Prompts per step

128

Generations per prompt

16

Max Generation Length

49K tokens

Epsilon Filtering

Cosine annealing with 4% limit

MoE Load Balancing

DeepSeek aux-loss-free strategy

Reasoning Control#

The model supports:

  • Reasoning on/off control — Strip reasoning from 10% of samples

  • Token budget control — Truncate 3% of reasoning traces to different budgets

Requirements#

  • GPU nodes: Recommended 8 GPUs per node (H100)

  • Ray cluster: Automatically initialized for distributed execution

NVIDIA AI Stack#

This stage uses the following components from the NVIDIA AI Stack:

Component

Role

Documentation

NeMo-RL

GRPO algorithm, policy training, reward computation

Docs

Ray

Distributed actor coordination

Docs

vLLM

Fast rollout generation

GitHub

Key Features Used#

Feature

Purpose

GRPO algorithm

Group Relative Policy Optimization with clipped gradients

Multi-environment training

Simultaneous training across 7 reward environments

NeMo-Gym

Reward environments (math, code, tool-use)

DTensor backend

FSDP2-based distributed training

Architecture#

NeMo-RL uses a Ray-based actor model:

Actor

Function

Policy Model

Trainable policy weights

Generator

vLLM-backed rollout generation

Reward Model

Environment-specific reward computation

Reference Model

KL divergence regularization

Container#

nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano

Open-Source Data#

Note: This recipe trains exclusively on the open-sourced subset of RL data. Results will differ from the tech report benchmarks, which used additional proprietary data.

Reference#