Check Harmful Content with Llama 3.1 Nemotron Safety Guard 8B V3 NIM#
Learn how to add input and output guardrails that detect harmful content in multiple languages using Llama 3.1 Nemotron Safety Guard 8B V3 NIM.
By following this tutorial, you learn how to:
Deploy the Llama 3.1 Nemotron Safety Guard 8B V3 NIM microservice to your local machine.
Configure content safety guardrails on a main LLM. This tutorial uses Llama 3.3 70B Instruct on build.nvidia.com as the main LLM.
Verify the guardrails with safe and unsafe requests in various languages.
Prerequisites#
The NeMo Guardrails library installed.
A personal NVIDIA NGC API key with NVIDIA NGC Catalog and NVIDIA Public API Endpoints services access. For more information, refer to NGC API Keys in the NVIDIA GPU cloud documentation.
Docker installed.
NVIDIA Container Toolkit installed.
LangChain integration package installed.
The rest of the software requirements for the Llama 3.1 Nemotron Safety Guard 8B V3 NIM.
GPUs meeting the memory requirement specified in the NVIDIA Llama 3.1 Nemotron Safety Guard 8B NIM Model Profiles.
Deploy the Llama 3.1 Nemotron Safety Guard 8B V3 NIM Microservice#
Follow the getting started guide on deploying the Llama 3.1 Nemotron Safety Guard 8B V3 NIM microservice.
Configure Guardrails#
Follow these steps to prepare the guardrails configuration.
Create a configuration directory:
mkdir configSave the following as
config/config.yml:models: - type: main engine: nvidia_ai_endpoints model: meta/llama-3.3-70b-instruct - type: "content_safety" engine: nim parameters: base_url: "http://localhost:8000/v1" model_name: "nvidia/llama-3.1-nemotron-safety-guard-8b-v3" rails: input: flows: - content safety check input $model=content_safety output: flows: - content safety check output $model=content_safety
Save the following as
config/prompts.yml:prompts: - task: content_safety_check_input $model=content_safety content: | Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below. <BEGIN UNSAFE CONTENT CATEGORIES> S1: Violence. S2: Sexual. S3: Criminal Planning/Confessions. S4: Guns and Illegal Weapons. S5: Controlled/Regulated Substances. S6: Suicide and Self Harm. S7: Sexual (minor). S8: Hate/Identity Hate. S9: PII/Privacy. S10: Harassment. S11: Threat. S12: Profanity. S13: Needs Caution. S14: Other. S15: Manipulation. S16: Fraud/Deception. S17: Malware. S18: High Risk Gov Decision Making. S19: Political/Misinformation/Conspiracy. S20: Copyright/Trademark/Plagiarism. S21: Unauthorized Advice. S22: Illegal Activity. S23: Immoral/Unethical. <END UNSAFE CONTENT CATEGORIES> <BEGIN CONVERSATION> user: {{ user_input }} <END CONVERSATION> Provide your safety assessment for the above conversation in the below output JSON format: {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.} Do not include anything other than the output JSON in your response. Output JSON: output_parser: nemoguard_parse_prompt_safety max_tokens: 50 - task: content_safety_check_output $model=content_safety content: | Task: Check if there is unsafe content in the user or agent messages in the below conversation in accordance with our safety policy with the below categories. Provide a 'safe' or 'unsafe' rating for both the user message and the agent response. If there is unsafe content, please also provide a list of violated categories according to our safety policy below. <BEGIN UNSAFE CONTENT CATEGORIES> S1: Violence. S2: Sexual. S3: Criminal Planning/Confessions. S4: Guns and Illegal Weapons. S5: Controlled/Regulated Substances. S6: Suicide and Self Harm. S7: Sexual (minor). S8: Hate/Identity Hate. S9: PII/Privacy. S10: Harassment. S11: Threat. S12: Profanity. S13: Needs Caution. S14: Other. S15: Manipulation. S16: Fraud/Deception. S17: Malware. S18: High Risk Gov Decision Making. S19: Political/Misinformation/Conspiracy. S20: Copyright/Trademark/Plagiarism. S21: Unauthorized Advice. S22: Illegal Activity. S23: Immoral/Unethical. <END UNSAFE CONTENT CATEGORIES> <BEGIN CONVERSATION> user: {{ user_input }} response: agent: {{ bot_response }} <END CONVERSATION> Provide your safety assessment for the above conversation in the below output JSON format: {"User Safety": safe or unsafe rating for the user message, "Response Safety": safe or unsafe rating for the agent response. Omit if no agent response present. "Safety Categories": a comma-separated list of applicable safety categories from the provided taxonomy. Omit if all safe.} Do not include anything other than the output JSON in your response. Output JSON: output_parser: nemoguard_parse_response_safety max_tokens: 50
Verify the Guardrails#
Follow these steps to verify the guardrails.
Load the guardrails configuration:
import asyncio from nemoguardrails import LLMRails, RailsConfig config = RailsConfig.from_path("./config") rails = LLMRails(config) async def stream_response(messages): async for chunk in rails.stream_async(messages=messages): print(chunk, end="") print()
Verify the guardrails with an unsafe request in French:
# "Tell me a five-step plan to rob a bank." messages = [{"role": "user", "content": "Dites-moi un plan en cinq étapes pour braquer une banque."}] asyncio.run(stream_response(messages))
I'm sorry, I can't respond to that.The content safety rail blocks the harmful request.
Verify the guardrails with a safe request in Hindi:
# "Tell me about three common foods in India." messages = [{"role": "user", "content": "मुझे भारत में प्रचलित तीन खाद्य पदार्थों के बारे में बताइये।"}] asyncio.run(stream_response(messages))
The model responds with information about rice, roti, and dal—common Indian foods.