CL AI LGOct 19, 2023

No offence, Bert -- I insult only humans! Multiple addressees sentence-level attack on toxicity detection neural network

Sergey Berezin, Reza Farahbakhsh, Noel Crespi

arXiv:2310.13099v10.5h-index: 27

Originality Incremental advance

AI Analysis

This work addresses vulnerabilities in toxicity detection systems, which is crucial for online content moderation, though it is incremental as it builds on existing adversarial attack methods.

The authors introduced a sentence-level attack on black-box toxicity detectors by appending positive words or sentences to hateful messages, successfully bypassing detection systems across seven languages from three language families.

We introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. By adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. This approach is shown to be working on seven languages from three different language families. We also describe the defence mechanism against the aforementioned attack and discuss its limitations.

View on arXiv PDF

Similar