Skip to main content

AI Response Moderation

AI Response Moderation scans the AI's replies and blocks responses that contain harmful or unwanted content.

info

This feature is currently in BETA and only works with ChatGPT. It does not yet work with Claude.

How to turn it on

  1. Click the Honest AI Shield icon in your toolbar
  2. On the Shields tab, expand the Premium Shield section
  3. Find "AI Response Moderation" (marked BETA)
  4. Flip the toggle to on
 DLP toggle in Premium Shield section.

What it scans for

When Response Moderation is on, it checks AI responses for:

CategoryWhat it catches
Censored KeywordsWords you have added to your censored keywords list
Suspicious LinksURLs that may be unsafe
ViolenceViolent or graphic content
Hate SpeechDiscriminatory or hateful language
Sexual ContentSexually explicit material
Criminal ActivityContent describing illegal acts
ProfanityStrong or offensive language
WeaponsContent about weapons

How to adjust what gets flagged

  1. Go to the Policies tab
  2. Expand "AI Response Moderation Policies"
  3. You will see two subsections:
    • Keyword Filtering (Local) — toggle for censored keywords matching
    • Response Moderation (AI) — toggles for each content category (suspicious links, violence, hate speech, etc.)

Turn individual categories on or off based on your needs.

 DLP toggle in Premium Shield section.

Important

The main AI Response Moderation toggle in the Shields tab must be on for any of these policy settings to work. If the main toggle is off, no response scanning happens regardless of the policy settings.

Scanning for censored keywords in AI responses requires both: (1) the AI Response Moderation toggle turned on, and (2) the Censored Keywords toggle turned on in the Response Moderation policy. You also need to have added words to your censored keywords list (see Managing Keyword Policies).