Llama-Guard Integration#

NeMo Guardrails provides out-of-the-box support for content moderation using Meta’s Llama Guard model.

In our testing, we observe significantly improved input and output content moderation performance compared to the self-check method. Please see the performance evaluation for benchmark numbers.

Usage#

To configure your bot to use Llama Guard for input/output checking, follow the below steps:

  1. Add a model of type llama_guard to the models section of the config.yml file (the example below uses a vLLM setup):

    models:
      ...
    
      - type: llama_guard
        engine: vllm_openai
        parameters:
          openai_api_base: "http://localhost:5123/v1"
          model_name: "meta-llama/LlamaGuard-7b"
    
  2. Include the llama guard check input and llama guard check output flow names in the rails section of the config.yml file:

    rails:
      input:
        flows:
          - llama guard check input
      output:
        flows:
          - llama guard check output
    
  3. Define the llama_guard_check_input and the llama_guard_check_output prompts in the prompts.yml file:

    prompts:
      - task: llama_guard_check_input
        content: |
          <s>[INST] Task: ...
          <BEGIN UNSAFE CONTENT CATEGORIES>
          O1: ...
          O2: ...
      - task: llama_guard_check_output
        content: |
          <s>[INST] Task: ...
          <BEGIN UNSAFE CONTENT CATEGORIES>
          O1: ...
          O2: ...
    

The rails execute the llama_guard_check_* actions, which return True if the user input or the bot message should be allowed, and False otherwise, along with a list of the unsafe content categories as defined in the Llama Guard prompt.

define flow llama guard check input
  $llama_guard_response = execute llama_guard_check_input
  $allowed = $llama_guard_response["allowed"]
  $llama_guard_policy_violations = $llama_guard_response["policy_violations"]

  if not $allowed
    bot refuse to respond
    stop

# (similar flow for checking output)

A complete example configuration that uses Llama Guard for input and output moderation is provided in this example folder.