Nemotron-Nano2-VL Notebooks#
A collection of notebooks demonstrating the capabilities of NVIDIA Nemotron Nano 2 VL, a 12B parameter model that unifies visual and textual understanding for advanced multimodal agentic workflows.
Overview#
These notebooks show how to use NVIDIA Nemotron Nano 2 VL to build applications that can see, read, and reason across diverse media. The model can extract, understand, and act on information from text, images, and videos, making it a powerful tool for next-generation AI agents.
Models#
VLM (NIM):
nvidia/nemotron-nano-2-vl(Available soon on NVIDIA AI Endpoints)VLM (Hugging Face):
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8(link)VLM (Hugging Face):
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16(link)
Key Features#
Agentic Multimodal Reasoning: Unifies visual and textual understanding to extract, reason, and act on information.
Versatile Inputs: Natively handles text prompts, image URLs, and video URLs in a single request.
Controllable Reasoning: Use the
/thinksystem prompt to enable detailed reasoning steps and/no_thinkfor direct answers.Multi-Image Understanding: Capable of reasoning across multiple images, such as different pages of a PDF, to answer complex questions.
Advanced Video Analysis: Performs dense captioning and summarization of video content.
Efficient Video Sampling (EVS): Automatically prunes redundant video frames to enable efficient long-context reasoning.
Hybrid Mamba-Transformer Architecture: Delivers high accuracy with superior throughput and lower latency.
Requirements#
NVIDIA API key (get one here)
GPU recommended for local deployment (e.g., single H100)