AI Response Moderation

AI Response Moderation scans the AI's replies and blocks responses that contain harmful or unwanted content.

info

This feature is currently in BETA and only works with ChatGPT. It does not yet work with Claude.

How to turn it on

Click the Honest AI Shield icon in your toolbar
On the Shields tab, expand the Premium Shield section
Find "AI Response Moderation" (marked BETA)
Flip the toggle to on

What it scans for

When Response Moderation is on, it checks AI responses for:

Category	What it catches
Censored Keywords	Words you have added to your censored keywords list
Suspicious Links	URLs that may be unsafe
Violence	Violent or graphic content
Hate Speech	Discriminatory or hateful language
Sexual Content	Sexually explicit material
Criminal Activity	Content describing illegal acts
Profanity	Strong or offensive language
Weapons	Content about weapons

How to adjust what gets flagged

Go to the Policies tab
Expand "AI Response Moderation Policies"
You will see two subsections:
- Keyword Filtering (Local) — toggle for censored keywords matching
- Response Moderation (AI) — toggles for each content category (suspicious links, violence, hate speech, etc.)

Turn individual categories on or off based on your needs.

Important

The main AI Response Moderation toggle in the Shields tab must be on for any of these policy settings to work. If the main toggle is off, no response scanning happens regardless of the policy settings.

Scanning for censored keywords in AI responses requires both: (1) the AI Response Moderation toggle turned on, and (2) the Censored Keywords toggle turned on in the Response Moderation policy. You also need to have added words to your censored keywords list (see Managing Keyword Policies).

How to turn it on​

What it scans for​

How to adjust what gets flagged​

How to turn it on

What it scans for

How to adjust what gets flagged