CLAICYLGNov 15, 2023

Beyond Detection: Unveiling Fairness Vulnerabilities in Abusive Language Models

arXiv:2311.09428v21.73 citationsh-index: 41
Originality Incremental advance
AI Analysis

This work addresses vulnerabilities in abusive language detection models, which is important for improving fairness robustness in digital platforms, though it is incremental as it builds on existing backdoor attack methods.

This paper tackles the problem of adversarial fairness attacks in abusive language detection models, proposing the FABLE framework that uses backdoor attacks with various trigger designs to undermine both fairness and detection performance, with experiments on benchmark datasets showing its effectiveness.

This work investigates the potential of undermining both fairness and detection performance in abusive language detection. In a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. We propose a simple yet effective framework FABLE that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. FABLE explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. Specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., "non-abusive") and flip their labels to the unfavored outcome, i.e., "abusive". Experiments on benchmark datasets demonstrate the effectiveness of FABLE attacking fairness and utility in abusive language detection.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes