LGAICLCYMar 18, 2023

NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models

arXiv:2303.10430v22 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the need for realistic robustness benchmarking in content moderation to improve detection of evasive toxic speech on social media, though it is incremental as it focuses on dataset creation rather than a new detection method.

The paper tackles the problem of detecting human-written toxic text perturbations that evade automated content moderation by introducing NoisyHate, a high-quality dataset of real-life human-written perturbations, and shows it has different characteristics than algorithm-generated datasets, validating it against state-of-the-art models like BERT and RoBERTa on tasks such as perturbation normalization and understanding.

Online texts with toxic content are a clear threat to the users on social media in particular and society in general. Although many platforms have adopted various measures (e.g., machine learning-based hate-speech detection systems) to diminish their effect, toxic content writers have also attempted to evade such measures by using cleverly modified toxic words, so-called human-written text perturbations. Therefore, to help build automatic detection tools to recognize those perturbations, prior methods have developed sophisticated techniques to generate diverse adversarial samples. However, we note that these ``algorithms"-generated perturbations do not necessarily capture all the traits of ``human"-written perturbations. Therefore, in this paper, we introduce a novel, high-quality dataset of human-written perturbations, named as NoisyHate, that was created from real-life perturbations that are both written and verified by human-in-the-loop. We show that perturbations in NoisyHate have different characteristics than prior algorithm-generated toxic datasets show, and thus can be in particular useful to help develop better toxic speech detection solutions. We thoroughly validate NoisyHate against state-of-the-art language models, such as BERT and RoBERTa, and black box APIs, such as Perspective API, on two tasks, such as perturbation normalization and understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes