Llama-Guard Integration#
NeMo Guardrails provides out-of-the-box support for content moderation using Meta’s Llama Guard model.
In our testing, we observe significantly improved input and output content moderation performance compared to the self-check method. Please see the performance evaluation for benchmark numbers.
Usage#
To configure your bot to use Llama Guard for input/output checking, follow the below steps:
Add a model of type
llama_guardto the models section of theconfig.ymlfile. The example below serves Llama Guard with vLLM. Because vLLM exposes an OpenAI-compatible API,engine: openaiplusparameters.base_urlreaches it through NeMo Guardrails’ built-in client with no LangChain dependency. For background, see Migrating to 0.22.models: ... - type: llama_guard engine: openai model: meta-llama/LlamaGuard-7b parameters: base_url: "http://localhost:5123/v1" api_key: EMPTY
Note
Set
api_key: EMPTY(or any non-empty placeholder) when self-hosted vLLM does not enforce auth. If your deployment requires a real token, replaceapi_key: EMPTYwith the literal token value, or omitapi_keyand setapi_key_env_varat the top level of the model entry (not insideparameters:):- type: llama_guard engine: openai model: meta-llama/LlamaGuard-7b api_key_env_var: MY_LLAMA_GUARD_API_KEY parameters: base_url: "http://localhost:5123/v1"
Include the
llama guard check inputandllama guard check outputflow names in the rails section of theconfig.ymlfile:rails: input: flows: - llama guard check input output: flows: - llama guard check output
Define the
llama_guard_check_inputand thellama_guard_check_outputprompts in theprompts.ymlfile:prompts: - task: llama_guard_check_input content: | <s>[INST] Task: ... <BEGIN UNSAFE CONTENT CATEGORIES> O1: ... O2: ... - task: llama_guard_check_output content: | <s>[INST] Task: ... <BEGIN UNSAFE CONTENT CATEGORIES> O1: ... O2: ...
The rails execute the llama_guard_check_* actions, which return True if the user input or the bot message should be allowed, and False otherwise, along with a list of the unsafe content categories as defined in the Llama Guard prompt.
define flow llama guard check input
$llama_guard_response = execute llama_guard_check_input
$allowed = $llama_guard_response["allowed"]
$llama_guard_policy_violations = $llama_guard_response["policy_violations"]
if not $allowed
bot refuse to respond
stop
# (similar flow for checking output)
A complete example configuration that uses Llama Guard for input and output moderation is provided in this example folder.