All You Need is "Leet": Evading Hate-speech Detection AI
This addresses the challenge of protecting users from hate-speech on social media by exposing vulnerabilities in detection AI, though it is incremental as it builds on existing evasion methods.
The paper tackled the problem of hate-speech detection on online platforms by designing black-box techniques to generate perturbations that fool state-of-the-art deep learning models, resulting in a successful evasion rate of 86.8% for hateful text while minimizing meaning changes.
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.