HuggingFace Classifier Integration#

Content moderation using HuggingFace text classification models on input, output, and retrieval flows.

Overview#

Fast, prompt-free alternative to LLM-based self-check rails. Supports four inference backends:

Backend	Engine	Endpoint	Use Case
Local	`local`	N/A (in-process)	HuggingFace Transformers pipeline
vLLM	`vllm`	`{base_url}/classify`	vLLM classify endpoint
KServe	`kserve`	`{base_url}/v1/models/{model}:predict`	KServe v1 predict endpoint
FMS	`fms`	`{base_url}/api/v1/text/contents`	IBM FMS guardrails-detectors endpoint

Setup#

For the local backend:

pip install nemoguardrails[hf-classifier]

Or directly:

pip install transformers torch

The model is downloaded on first use from HuggingFace Hub. For air-gapped environments, set HF_HUB_OFFLINE=1 and point model to a local path.

For remote backends, a running inference server is required. No additional Python dependencies are needed.

Colang 2.x requires an explicit import in your Colang file (e.g., config.co):

import nemoguardrails.library.hf_classifier

Colang 1.0 auto-discovers library flows.

Usage#

Configuration Structure#

Add the classifier configuration to your config.yml:

rails:
  config:
    hf_classifier:
      named_entity_recognition:
        engine: local
        model: dslim/distilbert-NER
        task: token-classification
        threshold: 0.7
        blocked_labels:
          - "PER"
          - "LOC"
          - "ORG"
        parameters:
          aggregation_strategy: simple
  input:
    flows:
      - hf classifier check input $classifier=named_entity_recognition
  output:
    flows:
      - hf classifier check output $classifier=named_entity_recognition

The $classifier parameter must match the name under rails.config.hf_classifier.

Configuration Options#

Common fields (all engines)

Option	Type	Default	Description
`engine`	string	required	`local`, `vllm`, `kserve`, or `fms`.
`model`	string	required	HuggingFace model ID, local path, or server-side model name.
`threshold`	float	`0.5`	Minimum score to trigger blocking (0.0-1.0).
`blocked_labels`	list	`[]`	Labels that trigger blocking above threshold. See Blocked Labels.

Blocked Labels#

Values must match the label strings returned by the model or server. For local and vLLM backends with text-classification, labels come from the model’s id2label mapping (e.g., "toxic", "LABEL_1"). For token-classification with aggregation_strategy, labels are entity groups with the B-/I- prefix stripped (e.g., "PER", "LOC"). For FMS, labels come from the detection_type field in the server response. For KServe, labels are stringified class indices ("0", "1").

To discover labels, inspect id2label from the model config:

from transformers import AutoConfig
config = AutoConfig.from_pretrained("dslim/distilbert-NER")
print(config.id2label)
# {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
# With aggregation_strategy: simple, use "PER", "ORG", "LOC", "MISC" (prefix stripped)

For remote servers, send a test request and inspect the response.

Local engine fields

Option	Type	Default	Description
`task`	string	`text-classification`	Pipeline task type. Use `token-classification` for NER models.
`parameters`	dict	`{}`	Kwargs forwarded to `transformers.pipeline()`.

Remote engine fields (vllm, kserve, fms)

Option	Type	Default	Description
`base_url`	string	required	Inference server URL.
`api_key_env_var`	string	`null`	Environment variable name holding the API key.
`parameters.timeout`	float	`30.0`	Request timeout in seconds.
`parameters.verify_ssl`	bool	`true`	Set `false` to skip TLS verification.
`parameters.ca_cert`	string	`null`	CA bundle path for custom CAs.
`parameters.client_cert`	string	`null`	Client certificate path for mTLS.
`parameters.client_key`	string	`null`	Client key path for mTLS. Requires `client_cert`.

Input Rails#

Prompt injection detection using KServe:

rails:
  config:
    hf_classifier:
      prompt_injection:
        engine: kserve
        model: prompt-injection-detector
        base_url: "https://prompt-injection-detector-route.apps.example.com"
        api_key_env_var: OCP_TOKEN
        threshold: 0.5
        blocked_labels:
          - "1"
        parameters:
          verify_ssl: false
  input:
    flows:
      - hf classifier check input $classifier=prompt_injection

Output Rails#

HAP detection using FMS:

rails:
  config:
    hf_classifier:
      hap:
        engine: fms
        model: hap-detector
        base_url: "https://detector-hap-route.apps.example.com"
        api_key_env_var: OCP_TOKEN
        threshold: 0.7
        blocked_labels:
          - "LABEL_1"
        parameters:
          verify_ssl: false
  output:
    flows:
      - hf classifier check output $classifier=hap

Retrieval Rails#

The retrieval rail classifies the combined retrieved text as a single input. If any blocked label is detected above threshold, all retrieved chunks are cleared.

rails:
  config:
    hf_classifier:
      named_entity_recognition:
        engine: local
        model: dslim/distilbert-NER
        task: token-classification
        threshold: 0.7
        blocked_labels:
          - "PER"
          - "LOC"
          - "ORG"
        parameters:
          aggregation_strategy: simple
  retrieval:
    flows:
      - hf classifier check retrieval $classifier=named_entity_recognition

Complete Example#

HAP (FMS), prompt injection (KServe), and language classification (vLLM) with streaming:

models:
  - type: main
    engine: openai
    model: my-model
    parameters:
      base_url: "https://llm-server.apps.example.com/v1"

rails:
  config:
    hf_classifier:
      hap:
        engine: fms
        model: hap-detector
        base_url: "https://detector-hap-route.apps.example.com"
        api_key_env_var: OCP_TOKEN
        threshold: 0.7
        blocked_labels:
          - "LABEL_1"
        parameters:
          verify_ssl: false

      prompt_injection:
        engine: kserve
        model: prompt-injection-detector
        base_url: "https://prompt-injection-detector-route.apps.example.com"
        api_key_env_var: OCP_TOKEN
        threshold: 0.5
        blocked_labels:
          - "1"
        parameters:
          verify_ssl: false

      lang:
        engine: vllm
        model: language-classifier
        base_url: "https://language-classifier-route.apps.example.com"
        api_key_env_var: OCP_TOKEN
        threshold: 0.5
        blocked_labels:
          - "fr"
          - "de"
          - "es"
        parameters:
          verify_ssl: false

  input:
    flows:
      - hf classifier check input $classifier=prompt_injection
      - hf classifier check input $classifier=hap
      - hf classifier check input $classifier=lang
  output:
    flows:
      - hf classifier check output $classifier=hap
    streaming:
      enabled: true
      stream_first: false

Return Value#

Returns True if allowed, False if blocked. Triggered labels and scores are logged at INFO level:

HF classifier 'hap': blocked (detections: [('LABEL_1', 0.92)])

mTLS and Custom CA#

rails:
  config:
    hf_classifier:
      toxicity:
        engine: kserve
        model: toxic-bert
        base_url: "https://classifier.internal:443"
        threshold: 0.7
        blocked_labels:
          - toxic
        parameters:
          ca_cert: /etc/ssl/custom-ca.pem
          client_cert: /etc/ssl/client.pem
          client_key: /etc/ssl/client.key

HF Classifier Rail Behavior#

When blocked, input and output rails respond with "I'm sorry, I can't respond to that." and abort. If enable_rails_exceptions is set, an InputRailException or OutputRailException is raised instead. Retrieval rails clear all retrieved chunks if any blocked label is detected. With streaming enabled, the output rail checks the accumulated response after streaming completes.