NeMo-RL: Journey of Optimizing Weight Transfer in Large MoE Models by 10x
Introduction
A typical Reinforcement Learning (RL) step involves updating the policy model using the data generated, and transferring the updated weights to the generation model. In addition, training and generation have different memory requirements and hence necessitates different parallelism schemes. The process of transferring weights and updating the sharding is called “refit”, sometimes also referred to as “resharding” in other RL frameworks. Figure 1 shows an example of the refit process for an MoE model, where the policy model and generation model have different parallelism schemes.