RAG Agent with Nemotron RAG Models#
A production-ready RAG (Retrieval-Augmented Generation) agent demonstrating a hybrid approach using local Hugging Face models for embeddings/reranking and NVIDIA AI Endpoints for LLM inference.
Overview#
This notebook builds an IT help desk support agent that can answer questions from an internal knowledge base using state-of-the-art retrieval and generation techniques.
Models Used#
Embedding:
nvidia/llama-3.2-nv-embedqa-1b-v2(Hugging Face)Reranking:
nvidia/llama-3.2-nv-rerankqa-1b-v2(Hugging Face)LLM:
nvidia/nvidia-nemotron-nano-9b-v2(NVIDIA AI Endpoints)
Key Features#
🏠 Local Models for embedding and reranking (privacy, performance, cost-effective)
☁️ NVIDIA AI Endpoints for LLM (managed service, latest models)
🤖 LangGraph ReAct Agent with tool integration
🔍 Advanced Retrieval with FAISS vector search and reranking
⚡ GPU Acceleration for local models (CPU fallback supported)
Requirements#
Python 3.8+
NVIDIA API key (get one here)
GPU recommended (falls back to CPU if unavailable)
Required packages:
transformers,langchain,langgraph,langchain-nvidia-ai-endpoints,faiss-cpu,torch