An Introduction to OpenAI Moderations API: A Mini-Guide for Beginners

The Moderation API offered by OpenAI is a valuable tool for screening out undesirable content. By utilizing this API, you can ensure that any prompts created by yourself or your users adhere to the usage policies set forth by OpenAI. However, it is worth noting that the Moderation API can also be employed as a filter in your own projects.

This API allows for the detection and classification of various forms of unwanted content, including content related to hate, threatening behavior, self-harm, sexual content, sexual content featuring minors, and violent content. In particular, the Moderation API can identify content that glorifies or promotes violent behavior and graphic depictions of violence, such as those involving death.

Moderation Example

The Moderations API is incredibly user-friendly and straightforward to implement. To access this API, simply use the moderation endpoint located at https://api.openai.com/v1/moderations. This endpoint accepts a POST request with a JSON payload that includes the input parameter, which represents the text you want to classify:

curl --location 'https://api.openai.com/v1/moderations' \
--header 'Authorization: Bearer API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"input": "Hello World"
}'

Here is the response:

{
"id": "modr-6yS0WIUzxDgZyX4HuvUtsezHocHcc",
"model": "text-moderation-004",
"results": [
{
"flagged": false,
"categories": {
"sexual": false,
"hate": false,
"violence": false,
"self-harm": false,
"sexual/minors": false,
"hate/threatening": false,
"violence/graphic": false
},
"category_scores": {
"sexual": 1.6681364286341704e-05,
"hate": 2.5028452910191845e-06,
"violence": 1.995009313304763e-08,
"self-harm": 3.9649059035973266e-10,
"sexual/minors": 7.246934607962885e-09,
"hate/threatening": 3.1661111232761385e-11,
"violence/graphic": 7.544665336922662e-09
}
}
]
}

Let’s examine the various response properties returned by the Moderations API:

  • “id”: This property represents the unique identifier assigned to the generated moderation response.
  • “model”: This property denotes the content moderation model utilized for generating the moderation response.
  • “results”: This property is an array of moderation objects that represent the classification results for the text provided. Although it is uncommon for this property to be an array, it is included to allow for the possibility of generating multiple moderation results simultaneously.
  • “flagged”: This property indicates whether the input text has been flagged for containing unwanted content.
  • “categories”: This property identifies which categories have been flagged in the input text.
  • “category_scores”: This property represents the score assigned to each category by the moderation model. The score is a floating-point number between 0 and 1, with a higher score indicating a greater likelihood that the text contains content relevant to that particular category.

In the example given, the input text did not trigger any flags as it was normal text. However, let’s consider an example that would trigger a flag:

curl --location 'https://api.openai.com/v1/moderations' \
--header 'Authorization: Bearer API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"input": "I will find you and kill you!"
}'

Here is the response:

{
"id": "modr-6yS7cwobv4sCLX8d6wMIdr7G4m5rR",
"model": "text-moderation-004",
"results": [
{
"flagged": true,
"categories": {
"sexual": false,
"hate": false,
"violence": true,
"self-harm": false,
"sexual/minors": false,
"hate/threatening": false,
"violence/graphic": false
},
"category_scores": {
"sexual": 0.00010696260869735852,
"hate": 0.00011891139729414135,
"violence": 0.9997988343238831,
"self-harm": 1.190846865561923e-09,
"sexual/minors": 3.2540290249016834e-06,
"hate/threatening": 3.4191150461992947e-06,
"violence/graphic": 0.0004651463241316378
}
}
]
}

It accurately classified this text as hateful. Here is another example:

curl --location 'https://api.openai.com/v1/moderations' \
--header 'Authorization: Bearer API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"input": "If you don't come back, I will kill myself!"
}'

Here is the response:

{
"id": "modr-6ySIj77CEd7ugb2fh6oZovJ1xlDLp",
"model": "text-moderation-004",
"results": [
{
"flagged": true,
"categories": {
"sexual": false,
"hate": false,
"violence": false,
"self-harm": true,
"sexual/minors": false,
"hate/threatening": false,
"violence/graphic": false
},
"category_scores": {
"sexual": 2.661974576767534e-05,
"hate": 3.548981055701006e-07,
"violence": 0.37420549988746643,
"self-harm": 0.992926299571991,
"sexual/minors": 3.5079882110267135e-08,
"hate/threatening": 5.988276985391394e-09,
"violence/graphic": 2.0964519080735045e-06
}
}
]
}

So, for this one, self-harm is triggered.

Moderation Models

It is possible to specify the content moderation model to be used by the Moderations API by including the “model” request parameter. By default, the API employs the “text-moderation-latest” model, which is regularly updated and guarantees the use of the most current and precise moderation model available. Alternatively, the “text-moderation-stable” model can be used, which is also updated periodically but is believed to be less accurate. Both models carry the same cost, so it is recommended to always opt for “text-moderation-latest” to obtain the most reliable results.

Mohamed SAKHRI
Mohamed SAKHRI

I'm the creator and editor-in-chief of Tech To Geek. Through this little blog, I share with you my passion for technology. I specialize in various operating systems such as Windows, Linux, macOS, and Android, focusing on providing practical and valuable guides.

Articles: 1751

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *