Nemotron-Nano2-VL Notebooks#

A collection of notebooks demonstrating the capabilities of NVIDIA Nemotron Nano 2 VL, a 12B parameter model that unifies visual and textual understanding for advanced multimodal agentic workflows.

Overview#

These notebooks show how to use NVIDIA Nemotron Nano 2 VL to build applications that can see, read, and reason across diverse media. The model can extract, understand, and act on information from text, images, and videos, making it a powerful tool for next-generation AI agents.

Models#

VLM (NIM): nvidia/nemotron-nano-2-vl (Available soon on NVIDIA AI Endpoints)
VLM (Hugging Face): nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 (link)
VLM (Hugging Face): nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 (link)

Key Features#

Agentic Multimodal Reasoning: Unifies visual and textual understanding to extract, reason, and act on information.
Versatile Inputs: Natively handles text prompts, image URLs, and video URLs in a single request.
Controllable Reasoning: Use the /think system prompt to enable detailed reasoning steps and /no_think for direct answers.
Multi-Image Understanding: Capable of reasoning across multiple images, such as different pages of a PDF, to answer complex questions.
Advanced Video Analysis: Performs dense captioning and summarization of video content.
Efficient Video Sampling (EVS): Automatically prunes redundant video frames to enable efficient long-context reasoning.
Hybrid Mamba-Transformer Architecture: Delivers high accuracy with superior throughput and lower latency.

Requirements#

NVIDIA API key (get one here)
GPU recommended for local deployment (e.g., single H100)