Skip to content

Megatron-Core

NeMo-RL V0.3: Scalable and Performant Post-training with Nemo-RL via Megatron-Core

The initial release of NeMo-RL included training support via PyTorch DTensor (otherwise known as FSDP2). This backend allows for native integration with the HuggingFace ecosystem, quick experimentation, and scaling via PyTorch native parallelisms (FSDP2, tensor parallel, sequence parallel and context parallel). However, when model sizes approach hundreds of billions of parameters, the DTensor path becomes insufficient. Activation memory from large models introduces significant recompute overhead, resulting in infeasibly slow step times. Furthermore, the DTensor path lacks optimized CUDA kernels and other performance enhancements necessary for optimal throughput. These challenges highlight the need for a more efficient solution, which is precisely what NVIDIA's Megatron-Core library is designed to provide.