Skip to content

Blog

Welcome to the NVIDIA NeMo blog! Here you'll find the latest insights, tutorials, and updates about the NeMo framework, including large language models, reinforcement learning, and cutting-edge AI research.

Recent Posts

Our blog automatically displays the latest posts below. Make sure to check back regularly for new content!

Guide to Fine-tune Nvidia NeMo models with Granary Data

Introduction

The Granary dataset stands out as one of the largest and most diverse open-source collections of European speech data available today. Designed to advance research and development in automatic speech recognition (ASR) and automatic speech translation (AST), Granary provides approximately 643,000 hours of audio paired with transcripts for ASR, and around 351,000 hours of aligned translation pairs. Its recordings are sourced from a variety of Creative Commons corpora—such as YODAS2, YouTube-Commons, VoxPopuli, and Libri-Light—and each sample is carefully reviewed to ensure that only clear, high-quality audio and accurate transcripts are included. Because the dataset includes consistent segment boundaries and normalized text across more than twenty-five languages (including Italian), it eliminates much of the preprocessing burden and allows you to focus on model development or evaluation.

NeMo-RL: Journey of Optimizing Weight Transfer in Large MoE Models by 10x

Introduction

A typical Reinforcement Learning (RL) step involves updating the policy model using the data generated, and transferring the updated weights to the generation model. In addition, training and generation have different memory requirements and hence necessitates different parallelism schemes. The process of transferring weights and updating the sharding is called “refit”, sometimes also referred to as “resharding” in other RL frameworks. Figure 1 shows an example of the refit process for an MoE model, where the policy model and generation model have different parallelism schemes.

image

Figure 1. An example refit process with different parallelism schemes in policy and generation.

🚀 NeMo Framework Now Supports Google Gemma 3n: Efficient Multimodal Fine-tuning Made Simple

Introduction

Gemma 3n is a generative AI model that takes inputs from a variety of modalities, including images and audio, and is optimized for efficient resource usage and fast inference on everyday devices. It introduces innovations such as Per-Layer Embedding parameter caching and the MatFormer architecture, which help reduce compute and memory demands, making it ideal for lightweight deployments. Some key highlights:

Fine-tune Hugging Face Models Instantly with Day-0 Support with NVIDIA NeMo AutoModel


As organizations strive to maximize the value of their generative AI investments, access to the latest model developments is crucial for continued success. By using state-of-the-art models on Day-0, teams can harness these innovations efficiently, maintain relevance, and be competitive.

The past year has seen a flurry of exciting model series releases in the open-source community, including Meta Llama, Google Gemma, Mistral Codestral, Codestral Mamba, Large 2, Mixtral, Qwen 3, 2, and 2.5, Deepseek R1, NVIDIA Nemotron, and NVIDIA Llama Nemotron. These models are often made available on the Hugging Face Hub, providing the broader community with easy access.

Shortly after release, many users focus on evaluating model capabilities and exploring potential applications. Fine-tuning for specific use cases often becomes a key priority to gain an understanding of the models' potential and to identify opportunities for innovation.

The NVIDIA NeMo Framework uses NVIDIA Megatron-Core and Transformer-Engine (TE) backends to achieve high throughput and Model Flops Utilization (MFU) on thousands of NVIDIA GPUs, driving exceptional performance. However, integrating new model architectures into the NeMo framework requires multi-stage model conversion using Megatron-Core primitives, followed by validation of different phases, including supervised and parameter-efficient finetuning, model evaluation, and Hugging Face to NeMo conversion. This introduces a time delay between model release and optimal training/post-training recipe development.

To ensure Day-0 support for the latest models, NeMo framework introduces the Automatic Model (AutoModel) feature.

NeMo-RL V0.3: Scalable and Performant Post-training with Nemo-RL via Megatron-Core

The initial release of NeMo-RL included training support via PyTorch DTensor (otherwise known as FSDP2). This backend allows for native integration with the HuggingFace ecosystem, quick experimentation, and scaling via PyTorch native parallelisms (FSDP2, tensor parallel, sequence parallel and context parallel). However, when model sizes approach hundreds of billions of parameters, the DTensor path becomes insufficient. Activation memory from large models introduces significant recompute overhead, resulting in infeasibly slow step times. Furthermore, the DTensor path lacks optimized CUDA kernels and other performance enhancements necessary for optimal throughput. These challenges highlight the need for a more efficient solution, which is precisely what NVIDIA's Megatron-Core library is designed to provide.