Skip to content

Reinforcement Learning

NeMo-RL: Journey of Optimizing Weight Transfer in Large MoE Models by 10x

Introduction

A typical Reinforcement Learning (RL) step involves updating the policy model using the data generated, and transferring the updated weights to the generation model. In addition, training and generation have different memory requirements and hence necessitates different parallelism schemes. The process of transferring weights and updating the sharding is called “refit”, sometimes also referred to as “resharding” in other RL frameworks. Figure 1 shows an example of the refit process for an MoE model, where the policy model and generation model have different parallelism schemes.

image

Figure 1. An example refit process with different parallelism schemes in policy and generation.

NeMo-RL V0.3: Scalable and Performant Post-training with Nemo-RL via Megatron-Core

The initial release of NeMo-RL included training support via PyTorch DTensor (otherwise known as FSDP2). This backend allows for native integration with the HuggingFace ecosystem, quick experimentation, and scaling via PyTorch native parallelisms (FSDP2, tensor parallel, sequence parallel and context parallel). However, when model sizes approach hundreds of billions of parameters, the DTensor path becomes insufficient. Activation memory from large models introduces significant recompute overhead, resulting in infeasibly slow step times. Furthermore, the DTensor path lacks optimized CUDA kernels and other performance enhancements necessary for optimal throughput. These challenges highlight the need for a more efficient solution, which is precisely what NVIDIA's Megatron-Core library is designed to provide.

Reinforcement Learning with NVIDIA NeMo-RL: Reproducing a DeepScaleR Recipe Using GRPO

Reinforcement learning (RL) is the backbone of interactive AI. It is fundamental for teaching agents to reason and learn from human preferences, enabling multiturn tool use, and much more. This post introduces NVIDIA NeMo-RL, a new open source post-training library that is built to support everything from single-GPU prototypes to thousand-GPU large models and to orchestrate multicomponent RL pipelines with ease.