HuggingFace Classifier Integration#
Content moderation using HuggingFace text classification models on input, output, and retrieval flows.
Overview#
Fast, prompt-free alternative to LLM-based self-check rails. Supports four inference backends:
Backend |
Engine |
Endpoint |
Use Case |
|---|---|---|---|
Local |
|
N/A (in-process) |
HuggingFace Transformers pipeline |
vLLM |
|
|
vLLM classify endpoint |
KServe |
|
|
KServe v1 predict endpoint |
FMS |
|
|
IBM FMS guardrails-detectors endpoint |
Setup#
For the local backend:
pip install nemoguardrails[hf-classifier]
Or directly:
pip install transformers torch
The model is downloaded on first use from HuggingFace Hub. For air-gapped environments, set HF_HUB_OFFLINE=1 and point model to a local path.
For remote backends, a running inference server is required. No additional Python dependencies are needed.
Colang 2.x requires an explicit import in your Colang file (e.g., config.co):
import nemoguardrails.library.hf_classifier
Colang 1.0 auto-discovers library flows.
Usage#
Configuration Structure#
Add the classifier configuration to your config.yml:
rails:
config:
hf_classifier:
named_entity_recognition:
engine: local
model: dslim/distilbert-NER
task: token-classification
threshold: 0.7
blocked_labels:
- "PER"
- "LOC"
- "ORG"
parameters:
aggregation_strategy: simple
input:
flows:
- hf classifier check input $classifier=named_entity_recognition
output:
flows:
- hf classifier check output $classifier=named_entity_recognition
The $classifier parameter must match the name under rails.config.hf_classifier.
Configuration Options#
Common fields (all engines)
Option |
Type |
Default |
Description |
|---|---|---|---|
|
string |
required |
|
|
string |
required |
HuggingFace model ID, local path, or server-side model name. |
|
float |
|
Minimum score to trigger blocking (0.0-1.0). |
|
list |
|
Labels that trigger blocking above threshold. See Blocked Labels. |
Blocked Labels#
Values must match the label strings returned by the model or server. For local and vLLM backends with text-classification, labels come from the model’s id2label mapping (e.g., "toxic", "LABEL_1"). For token-classification with aggregation_strategy, labels are entity groups with the B-/I- prefix stripped (e.g., "PER", "LOC"). For FMS, labels come from the detection_type field in the server response. For KServe, labels are stringified class indices ("0", "1").
To discover labels, inspect id2label from the model config:
from transformers import AutoConfig
config = AutoConfig.from_pretrained("dslim/distilbert-NER")
print(config.id2label)
# {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}
# With aggregation_strategy: simple, use "PER", "ORG", "LOC", "MISC" (prefix stripped)
For remote servers, send a test request and inspect the response.
Local engine fields
Option |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Pipeline task type. Use |
|
dict |
|
Kwargs forwarded to |
Remote engine fields (vllm, kserve, fms)
Option |
Type |
Default |
Description |
|---|---|---|---|
|
string |
required |
Inference server URL. |
|
string |
|
Environment variable name holding the API key. |
|
float |
|
Request timeout in seconds. |
|
bool |
|
Set |
|
string |
|
CA bundle path for custom CAs. |
|
string |
|
Client certificate path for mTLS. |
|
string |
|
Client key path for mTLS. Requires |
Input Rails#
Prompt injection detection using KServe:
rails:
config:
hf_classifier:
prompt_injection:
engine: kserve
model: prompt-injection-detector
base_url: "https://prompt-injection-detector-route.apps.example.com"
api_key_env_var: OCP_TOKEN
threshold: 0.5
blocked_labels:
- "1"
parameters:
verify_ssl: false
input:
flows:
- hf classifier check input $classifier=prompt_injection
Output Rails#
HAP detection using FMS:
rails:
config:
hf_classifier:
hap:
engine: fms
model: hap-detector
base_url: "https://detector-hap-route.apps.example.com"
api_key_env_var: OCP_TOKEN
threshold: 0.7
blocked_labels:
- "LABEL_1"
parameters:
verify_ssl: false
output:
flows:
- hf classifier check output $classifier=hap
Retrieval Rails#
The retrieval rail classifies the combined retrieved text as a single input. If any blocked label is detected above threshold, all retrieved chunks are cleared.
rails:
config:
hf_classifier:
named_entity_recognition:
engine: local
model: dslim/distilbert-NER
task: token-classification
threshold: 0.7
blocked_labels:
- "PER"
- "LOC"
- "ORG"
parameters:
aggregation_strategy: simple
retrieval:
flows:
- hf classifier check retrieval $classifier=named_entity_recognition
Complete Example#
HAP (FMS), prompt injection (KServe), and language classification (vLLM) with streaming:
models:
- type: main
engine: openai
model: my-model
parameters:
base_url: "https://llm-server.apps.example.com/v1"
rails:
config:
hf_classifier:
hap:
engine: fms
model: hap-detector
base_url: "https://detector-hap-route.apps.example.com"
api_key_env_var: OCP_TOKEN
threshold: 0.7
blocked_labels:
- "LABEL_1"
parameters:
verify_ssl: false
prompt_injection:
engine: kserve
model: prompt-injection-detector
base_url: "https://prompt-injection-detector-route.apps.example.com"
api_key_env_var: OCP_TOKEN
threshold: 0.5
blocked_labels:
- "1"
parameters:
verify_ssl: false
lang:
engine: vllm
model: language-classifier
base_url: "https://language-classifier-route.apps.example.com"
api_key_env_var: OCP_TOKEN
threshold: 0.5
blocked_labels:
- "fr"
- "de"
- "es"
parameters:
verify_ssl: false
input:
flows:
- hf classifier check input $classifier=prompt_injection
- hf classifier check input $classifier=hap
- hf classifier check input $classifier=lang
output:
flows:
- hf classifier check output $classifier=hap
streaming:
enabled: true
stream_first: false
Return Value#
Returns True if allowed, False if blocked. Triggered labels and scores are logged at INFO level:
HF classifier 'hap': blocked (detections: [('LABEL_1', 0.92)])
mTLS and Custom CA#
rails:
config:
hf_classifier:
toxicity:
engine: kserve
model: toxic-bert
base_url: "https://classifier.internal:443"
threshold: 0.7
blocked_labels:
- toxic
parameters:
ca_cert: /etc/ssl/custom-ca.pem
client_cert: /etc/ssl/client.pem
client_key: /etc/ssl/client.key
HF Classifier Rail Behavior#
When blocked, input and output rails respond with "I'm sorry, I can't respond to that." and abort. If enable_rails_exceptions is set, an InputRailException or OutputRailException is raised instead. Retrieval rails clear all retrieved chunks if any blocked label is detected. With streaming enabled, the output rail checks the accumulated response after streaming completes.