Stage 2: Reinforcement Learning (RL)#
This stage aligns the instruction-tuned model using GRPO (Group Relative Policy Optimization) with NeMo-RL.
Open-Source Data Only: This recipe uses exclusively open-sourced RL data from the Nemotron Post-training Datasets collection, which is a subset of the full data used to train the released model. The recipe uses the Nemotron-3-Nano-RL-Training-Blend dataset. Results will differ from the benchmarks in the tech report. Use this recipe as a reference implementation to apply the methodology with your own data.
Training Methodology#
Training Framework: RL alignment is implemented using NeMo-RL with Ray for distributed actor coordination and vLLM for fast rollout generation. The Megatron backend handles distributed policy training with tensor, pipeline, context, and expert parallelism. See NeMo-RL Documentation for implementation details.
For complete methodology, see Tech Report Section 3.2.
RL Pipeline Overview#
The RL pipeline consists of three components:
RLVR — Multi-environment training with verifiable rewards
RLHF with GenRM — Generative reward model-based alignment
DPO — Preference learning to reduce tool hallucination
Data Preparation Pipeline#
Before training, the RL dataset is transformed into JSONL format compatible with NeMo-Gym:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart LR
subgraph prep["Data Preparation"]
direction LR
hf["HuggingFace<br/>Dataset"] --> resolve["Placeholder<br/>Resolution"]
resolve --> jsonl["JSONL<br/>Format"]
jsonl --> split["Train/Val/Test<br/>Split"]
end
split --> gym["NeMo-Gym<br/>Environment"]
gym --> reward["Reward<br/>Computation"]
style hf fill:#e1f5fe,stroke:#2196f3
style resolve fill:#e1f5fe,stroke:#2196f3
style jsonl fill:#f3e5f5,stroke:#9c27b0
style split fill:#f3e5f5,stroke:#9c27b0
style gym fill:#e8f5e9,stroke:#4caf50
style reward fill:#e8f5e9,stroke:#4caf50
Stage |
What Happens |
|---|---|
HuggingFace Dataset |
Load Nemotron-3-Nano-RL-Training-Blend from HuggingFace Hub |
Placeholder Resolution |
Resolve |
JSONL Format |
Convert to JSONL with |
Train/Val/Test Split |
Split into training (98%), validation (1%), and test (1%) sets |
NeMo-Gym Environment |
Route samples to appropriate reward environments based on task type |
Reward Computation |
Compute verifiable rewards (math correctness, code execution, schema adherence) |
Placeholder Resolution:
The Nemotron-3-Nano-RL-Training-Blend dataset contains placeholder records that reference external HuggingFace datasets. The data_prep.py script resolves these by:
Detecting placeholder records by the presence of
_hf_placeholderfieldFetching actual data from external HF datasets:
ByteDance-Seed/DAPO-Math-17k — Math reasoning problems
Skywork/Skywork-OR1-RL-Data — Open reasoning data
Applying template restoration (DAPO prefix/suffix, Skywork
{question}replacement)
For data preparation implementation, see Recipe Source:
src/nemotron/recipes/nano3/stage2_rl/data_prep.py
GRPO Algorithm#
GRPO (Group Relative Policy Optimization) optimizes the policy using group-relative advantages:
Generate responses from the current policy using vLLM
Evaluate responses using NeMo-Gym reward environments
Compute group-relative advantages across response groups per prompt
Update the policy to favor higher-reward responses with clipped gradients
Loss Function:
The GRPO loss uses clipped policy gradients with KL regularization:
Where:
\(\pi_\theta\) is the policy being optimized
\(\pi_{\theta_{\text{old}}}\) is the policy from the beginning of this step
\(A_t\) is the advantage estimate (group-relative)
\(\varepsilon\) is the clipping hyperparameter (0.2–0.28)
\(\beta\) is the KL penalty coefficient
\(\pi_{\text{ref}}\) is the reference policy (frozen SFT checkpoint)
Stability Improvements:
Improvement |
Description |
|---|---|
On-Policy KL Approximation |
Uses importance weights to correct for off-policy samples, providing an unbiased and guaranteed-positive KL estimator |
Importance Sampling Correction |
Corrects for discrepancies between inference (vLLM) and training (Megatron) token probabilities |
Overlong Filtering |
Excludes sequences that hit max length without EOS from loss computation, reducing noise from truncated generations |
Asymmetric Clipping |
Uses |
For detailed loss function derivations, see the NeMo-RL GRPO Guide.
Multi-Environment RLVR#
Training uses 6 reward environments through NeMo-Gym:
Environment |
Description |
Reward Type |
|---|---|---|
math_with_judge |
Mathematical reasoning (DAPO, Skywork math) |
Answer correctness verification |
code_gen |
Code correctness with test case execution |
Unit test pass rate |
mcqa |
STEM multiple choice questions |
Answer matching |
instruction_following |
IFEval, Multi-Challenge compliance |
Constraint satisfaction |
workplace_assistant |
Agentic tool use, multi-turn interactions |
Task completion |
structured_outputs_json |
JSON schema adherence |
Schema validation |
Training on all environments simultaneously provides stable gains without co-reward degradation.
For environment implementation details, see NeMo-RL Environments Guide.
GenRM (RLHF)#
Generative reward models use circular comparison strategy (N comparisons instead of O(N²)) with length-normalized reward adjustment:
Parameter |
Value |
|---|---|
Prompts per batch |
128 |
Responses per prompt |
16 |
Comparison strategy |
Circular |
Length bonus α |
0.5 |
For GenRM training details, see Tech Report Section 3.2.
DPO for Tool Hallucination#
DPO reduces hallucinated tool usage with minimal computational overhead:
Metric |
Before DPO |
After DPO |
|---|---|---|
AIME25 Accuracy |
80.88% |
84.58% |
Hallucination Rate |
8.33% |
0.7% |
For DPO methodology, see Tech Report Appendix C and NeMo-RL DPO Guide.
Reasoning Control#
The model supports:
Reasoning on/off control — Strip reasoning from 10% of samples
Token budget control — Truncate 3% of reasoning traces to different budgets
Hyperparameters#
GRPO Settings:
Parameter |
Value |
Description |
|---|---|---|
|
128 |
Prompts sampled per training step |
|
16 |
Rollouts generated per prompt |
|
49152 |
Maximum sequence length (~49K tokens) |
|
true |
Normalize rewards across batch |
|
true |
Variance reduction for advantage estimation |
|
5 |
Validation every N steps |
|
1 |
Single epoch over data |
|
42 |
Random seed for reproducibility |
Loss Function:
Parameter |
Value |
Description |
|---|---|---|
|
0.2 |
Lower bound for importance ratio clipping |
|
0.28 |
Upper bound for importance ratio clipping |
|
true |
Use unbiased on-policy KL estimator |
|
true |
Correct for inference/training mismatch |
|
true |
Per-token loss normalization |
|
0 |
KL regularization weight (disabled) |
Optimizer:
Parameter |
Value |
|---|---|
|
AdamW |
|
3e-6 |
|
3e-6 |
|
0.0 |
|
0.9 |
|
0.999 |
|
1e-8 |
|
1.0 |
Sequence Packing:
Parameter |
Value |
|---|---|
|
true |
|
modified_first_fit_decreasing |
|
64 |
Recipe Execution#
Quick Start#
// 1. Prepare data (convert to JSONL format)
$ uv run nemotron nano3 data prep rl --run YOUR-CLUSTER
// 2. Run RL training
$ uv run nemotron nano3 rl --run YOUR-CLUSTER
Note: The
--run YOUR-CLUSTERflag submits jobs via NeMo-Run. See Execution through NeMo-Run for setup.
Running in NeMo-RL Repository#
For direct execution using NeMo-RL (without the nemotron CLI wrapper), follow the NeMo-RL Nemotron 3 Nano Guide:
1. Download and prepare the dataset:
# Download the RL blend dataset
huggingface-cli download nvidia/Nemotron-3-Nano-RL-Training-Blend \
--repo-type dataset \
--local-dir /path/to/rl-blend
# Fill in placeholder entries (resolves DAPO, Skywork references)
python /path/to/rl-blend/create_nanov3_jsonl.py /path/to/rl-blend/data/train.jsonl
# Split into train/validation
head -n -1000 /path/to/rl-blend/data/train.jsonl > /path/to/train.jsonl
tail -n 1000 /path/to/rl-blend/data/train.jsonl > /path/to/validation.jsonl
2. Run GRPO training:
# From NeMo-RL repository root
uv run python examples/nemo_gym/run_grpo_nemo_gym.py \
--config examples/nemo_gym/grpo_nanov3.yaml \
data.train_jsonl_fpath=/path/to/train.jsonl \
data.validation_jsonl_fpath=/path/to/validation.jsonl \
policy.model_name=/path/to/sft/checkpoint \
logger.wandb_enabled=True
Note: The default recipe requires 32 nodes with 8 GPUs each. See the NeMo-RL cluster documentation for Slurm configuration.
Configuration#
File |
Purpose |
|---|---|
|
Production GRPO configuration |
|
Data preparation settings |
|
RL dataset blend |
Data Preparation#
The data_prep.py script converts datasets to JSONL format compatible with NeMo-RL’s NeMo-Gym interface. See Data Preparation Module for detailed documentation.
CLI Command#
uv run nemotron nano3 data prep rl [options]
Option |
Description |
|---|---|
|
Execute on Slurm via NeMo-Run |
|
Limit rows per dataset (for testing) |
|
Force re-run, ignoring cache |
Output#
output/nano3/stage2_rl/
├── train/
│ └── data.jsonl
├── val/
│ └── data.jsonl
├── test/
│ └── data.jsonl
└── manifest.json
The output is registered as a W&B Artifact (DataBlendsArtifact-rl) for lineage tracking.
Training#
CLI Command#
uv run nemotron nano3 rl [options] [overrides...]
Option |
Description |
|---|---|
|
Attached—submits and waits, streaming logs (NeMo-Run) |
|
Detached—submits and exits immediately (NeMo-Run) |
|
Preview execution plan |
|
Override config values (CLI Framework) |
Override Examples#
# More training steps
uv run nemotron nano3 rl grpo.max_num_steps=200000
# Different temperature for generation
uv run nemotron nano3 rl policy.generation.temperature=0.8
# Different learning rate
uv run nemotron nano3 rl policy.megatron_cfg.optimizer.lr=5e-7
# Disable sequence packing
uv run nemotron nano3 rl policy.sequence_packing.enabled=false
Running with NeMo-Run#
Configure execution profiles in env.toml:
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
[YOUR-CLUSTER]
executor = "slurm"
account = "YOUR-ACCOUNT"
partition = "batch"
nodes = 32
ntasks_per_node = 8
gpus_per_node = 8
mem = "0"
exclusive = true
mounts = ["/lustre:/lustre"]
See Execution through NeMo-Run for complete configuration options.
Checkpoint & Resume#
Training automatically saves checkpoints based on validation reward. To resume from a checkpoint:
# Resume from a specific checkpoint
uv run nemotron nano3 rl policy.model_name=/path/to/checkpoint
# Resume from latest checkpoint in results directory
uv run nemotron nano3 rl checkpointing.checkpoint_dir=/path/to/results
Checkpoint Configuration:
Option |
Value |
Description |
|---|---|---|
|
10 |
Steps between checkpoint saves |
|
val:total_reward/mean |
Metric for best checkpoint selection |
|
true |
Higher reward = better checkpoint |
|
1000000 |
Number of checkpoints to retain |
Troubleshooting#
Common errors and solutions:
Error |
Cause |
Solution |
|---|---|---|
High |
Mismatch between vLLM and Megatron probabilities |
Check weight refitting; ensure vLLM compilation settings match |
KL divergence spikes |
Single token probability errors in MoE |
Monitor |
OOM during generation |
vLLM memory allocation too high |
Reduce |
Slow convergence |
Learning rate too low or high |
Adjust |
Debugging tips:
Monitor
token_mult_prob_errorfor inference/training consistency (should stay below ~2%)Watch
sampling_importance_ratio(should hover around 1.0)Check
approx_entropyfor entropy collapse during trainingUse
--sample Nin data prep for quick iteration
Artifact Lineage#
%%{init: {'theme': 'base', 'themeVariables': { 'primaryBorderColor': '#333333', 'lineColor': '#333333', 'primaryTextColor': '#333333'}}}%%
flowchart TB
prev["ModelArtifact-sft<br/>(from Stage 1)"] --> train
rl["RL Datasets<br/>(HuggingFace)"] --> dp["data_prep.py"]
dp --> data["DataBlendsArtifact-rl<br/>(JSONL files)"]
data --> train["train.py<br/>(GRPO with NeMo-RL)"]
train --> model["ModelArtifact-rl<br/>(final aligned model)"]
style prev fill:#f3e5f5,stroke:#9c27b0
style rl fill:#e8f5e9,stroke:#4caf50
style dp fill:#e8f5e9,stroke:#4caf50
style data fill:#e8f5e9,stroke:#4caf50
style train fill:#e8f5e9,stroke:#4caf50
style model fill:#e8f5e9,stroke:#4caf50
Infrastructure#
This stage uses the following components from the NVIDIA AI Stack:
Component |
Role |
Documentation |
|---|---|---|
GRPO algorithm, policy training, reward computation |
||
Distributed training primitives (TP, PP, CP, EP) |
||
Distributed actor coordination |
||
vLLM |
Fast rollout generation |
Parallelism Configuration#
Training uses multiple parallelism strategies for efficient scaling:
Parallelism |
Value |
Config Key |
|---|---|---|
Tensor (TP) |
2 |
|
Pipeline (PP) |
2 |
|
Context (CP) |
4 |
|
Expert (EP) |
8 |
|
Sequence (SP) |
Yes |
|
Generation (vLLM):
Parameter |
Value |
Description |
|---|---|---|
|
4 |
TP for vLLM generation |
|
0.5 |
GPU memory fraction for KV cache |
|
true |
Share GPUs with training |
|
false |
Use torch.compile |
Cluster:
Parameter |
Value |
|---|---|
|
32 |
|
8 |
Key Features Used#
Feature |
Purpose |
|---|---|
GRPO algorithm |
Group Relative Policy Optimization with clipped gradients |
Megatron backend |
Distributed training with TP/PP/CP/EP parallelism |
Sequence Packing |
Efficient batch utilization for variable-length generations |
vLLM Generation |
Fast rollout with tensor parallelism |
MoE Router Bias |
Aux-loss-free load balancing ( |
Per-token Loss |
Consistent gradient signal ( |
NeMo-Gym Environments#
The training configuration includes these reward environment configs:
env:
nemo_gym:
config_paths:
- responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
- resources_servers/math_with_judge/configs/math_with_judge.yaml
- resources_servers/code_gen/configs/code_gen.yaml
- resources_servers/workplace_assistant/configs/workplace_assistant.yaml
- resources_servers/mcqa/configs/mcqa.yaml
- resources_servers/instruction_following/configs/instruction_following.yaml
- resources_servers/structured_outputs/configs/structured_outputs_json.yaml
Architecture#
NeMo-RL uses a Ray-based actor model:
Actor |
Function |
|---|---|
Policy Model |
Trainable policy weights (Megatron backend) |
Generator |
vLLM-backed rollout generation (colocated) |
Reward Environments |
NeMo-Gym environments for reward computation |
Reference Model |
Frozen SFT checkpoint for KL divergence |
Container#
nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
Next Steps#
After RL completes, the final aligned model (ModelArtifact-rl) is ready for evaluation and deployment.
Reference#
Tech Report Section 3.2 — RL methodology
NeMo-RL Documentation — GRPO, DPO, environments
NeMo-RL Nemotron 3 Nano Guide — Upstream training guide
NVIDIA AI Stack — NeMo-RL, Megatron-Core documentation
Artifact Lineage — W&B artifact system
Stage 0: Pretraining — Pretrain the base model
Stage 1: SFT — Instruction tuning
Recipe Source:
src/nemotron/recipes/nano3/stage2_rl/— Implementation details