AI Response Moderation
AI Response Moderation scans the AI's replies and blocks responses that contain harmful or unwanted content.
This feature is currently in BETA and only works with ChatGPT. It does not yet work with Claude.
How to turn it on
- Click the Honest AI Shield icon in your toolbar
- On the Shields tab, expand the Premium Shield section
- Find "AI Response Moderation" (marked BETA)
- Flip the toggle to on
What it scans for
When Response Moderation is on, it checks AI responses for:
| Category | What it catches |
|---|---|
| Censored Keywords | Words you have added to your censored keywords list |
| Suspicious Links | URLs that may be unsafe |
| Violence | Violent or graphic content |
| Hate Speech | Discriminatory or hateful language |
| Sexual Content | Sexually explicit material |
| Criminal Activity | Content describing illegal acts |
| Profanity | Strong or offensive language |
| Weapons | Content about weapons |
How to adjust what gets flagged
- Go to the Policies tab
- Expand "AI Response Moderation Policies"
- You will see two subsections:
- Keyword Filtering (Local) — toggle for censored keywords matching
- Response Moderation (AI) — toggles for each content category (suspicious links, violence, hate speech, etc.)
Turn individual categories on or off based on your needs.
The main AI Response Moderation toggle in the Shields tab must be on for any of these policy settings to work. If the main toggle is off, no response scanning happens regardless of the policy settings.
Scanning for censored keywords in AI responses requires both: (1) the AI Response Moderation toggle turned on, and (2) the Censored Keywords toggle turned on in the Response Moderation policy. You also need to have added words to your censored keywords list (see Managing Keyword Policies).