CLAICRLGApr 9, 2024

Towards Building a Robust Toxicity Predictor

arXiv:2404.08690v1228 citationsh-index: 36ACL
Originality Highly original
AI Analysis

This addresses a critical vulnerability in toxicity detection systems used in adversarial contexts, with incremental improvements in defense methods.

The paper tackled the problem of robustness in toxicity language predictors by introducing a novel adversarial attack called ToxicTrap, which achieved over 98% attack success rates in fooling state-of-the-art classifiers, and demonstrated that adversarial training can improve robustness against such attacks.

Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. ToxicTrap exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow ToxicTrap to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes