CL LGMar 21, 2022

On The Robustness of Offensive Language Classifiers

Jonathan Rusert, Zubair Shafiq, Padmini Srinivasan

arXiv:2203.11331v132.0640 citationsh-index: 38Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of ensuring reliable offensive speech detection on social media platforms, but it is incremental as it builds on prior limited studies of robustness.

The paper systematically analyzed the robustness of state-of-the-art offensive language classifiers against crafty adversarial attacks, finding that these attacks can degrade accuracy by more than 50% while preserving text readability and meaning.

Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50% while also being able to preserve the readability and meaning of the modified text.

View on arXiv PDF Code

Similar