OpenAI Moderations API#
NeMo Guardrails supports using the OpenAI Moderations API as an input or output rail out-of-the-box. You need to have the OPENAI_API_KEY
environment variable set or configure it in your OpenAI client.
Basic Usage#
rails:
input:
flows:
# The simplified version using the flagged response
- openai moderation
# The detailed version with individual category checks
# - openai moderation detailed
The openai moderation
flow uses OpenAI’s built-in flagging system to decide if the input should be allowed or not. The openai moderation detailed
flow checks individual violation categories with custom logic.
Using the Moderation API Endpoint#
You can also use the moderation endpoint directly:
rails:
config:
moderation:
providers:
- id: openai-moderation
provider: openai
model: omni-moderation-latest
action: openai_moderation_api
default: true
Then call the endpoint:
curl -X POST http://localhost:8000/v1/moderations \
-H "Content-Type: application/json" \
-d '{
"input": "...text to classify goes here..."
}'
Supported Categories#
OpenAI’s moderation API detects the following categories:
Category |
Description |
---|---|
|
Content that expresses, incites, or promotes harassing language towards any target |
|
Harassment content that also includes violence or serious harm towards any target |
|
Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste |
|
Hateful content that also includes violence or serious harm towards the targeted group |
|
Content that gives advice or instruction on how to commit illicit acts |
|
Illicit content that also includes references to violence or procuring a weapon |
|
Content that promotes, encourages, or depicts acts of self-harm |
|
Content where the speaker expresses intent to engage in acts of self-harm |
|
Content that encourages or gives instructions on acts of self-harm |
|
Content meant to arouse sexual excitement or that promotes sexual services |
|
Sexual content that includes an individual who is under 18 years old |
|
Content that depicts death, violence, or physical injury |
|
Content that depicts death, violence, or physical injury in graphic detail |
Customizing Thresholds#
To customize the behavior, you can overwrite the default flows in your config. For example, to create a custom moderation flow:
define subflow custom openai moderation
"""Custom guardrail with specific threshold logic."""
$result = execute openai_moderation_api
# Block if OpenAI flags it as harmful
if $result.get("flagged", False)
bot refuse to respond
stop
# Custom threshold checks on category scores
if $result.category_scores.get("violence", 0) > 0.5
bot inform cannot engage in violent content
stop
Detailed Category Handling#
Using OpenAI Text Moderation, you can control various violation categories individually. The API returns both boolean flags for each category and confidence scores. Here’s an example of a detailed input moderation flow:
define flow openai input moderation detailed
$result = execute openai_moderation_api(text=$user_message)
if $result.categories.get("harassment", False)
bot inform cannot engage in abusive or harmful behavior
stop
if $result.categories.get("hate", False)
bot inform cannot engage in abusive or harmful behavior
stop
if $result.categories.get("sexual", False)
bot inform cannot engage in inappropriate content
stop
if $result.categories.get("violence", False)
bot inform cannot engage in abusive or harmful behavior
stop
define bot inform cannot engage in abusive or harmful behavior
"I will not engage in any abusive or harmful behavior."
define bot inform cannot engage in inappropriate content
"I will not engage with inappropriate content."
Using with Category Scores#
You can also use the confidence scores for more nuanced control:
define subflow openai score based moderation
"""Moderation based on confidence scores rather than binary flags."""
$result = execute openai_moderation_api
# Custom thresholds for different categories
if $result.category_scores.get("harassment", 0) > 0.7
bot inform cannot engage in abusive or harmful behavior
stop
if $result.category_scores.get("hate", 0) > 0.6
bot inform cannot engage in abusive or harmful behavior
stop
if $result.category_scores.get("violence", 0) > 0.8
bot inform cannot engage in abusive or harmful behavior
stop
Environment Setup#
Make sure you have your OpenAI API key configured:
export OPENAI_API_KEY="your-api-key-here"
Or you can configure it in your application code when initializing the OpenAI client.
Installation#
The OpenAI moderation integration requires the openai
package:
pip install openai
This is typically included when you install NeMo Guardrails with OpenAI support.