RAG Agent with Nemotron RAG Models#

A production-ready RAG (Retrieval-Augmented Generation) agent demonstrating a hybrid approach using local Hugging Face models for embeddings/reranking and NVIDIA AI Endpoints for LLM inference.

Overview#

This notebook builds an IT help desk support agent that can answer questions from an internal knowledge base using state-of-the-art retrieval and generation techniques.

Models Used#

Embedding: nvidia/llama-3.2-nv-embedqa-1b-v2 (Hugging Face)
Reranking: nvidia/llama-3.2-nv-rerankqa-1b-v2 (Hugging Face)
LLM: nvidia/nvidia-nemotron-nano-9b-v2 (NVIDIA AI Endpoints)

Key Features#

🏠 Local Models for embedding and reranking (privacy, performance, cost-effective)
☁️ NVIDIA AI Endpoints for LLM (managed service, latest models)
🤖 LangGraph ReAct Agent with tool integration
🔍 Advanced Retrieval with FAISS vector search and reranking
⚡ GPU Acceleration for local models (CPU fallback supported)

Requirements#

Python 3.8+
NVIDIA API key (get one here)
GPU recommended (falls back to CPU if unavailable)
Required packages: transformers, langchain, langgraph, langchain-nvidia-ai-endpoints, faiss-cpu, torch