CLMay 25, 2020

Demoting Racial Bias in Hate Speech Detection

arXiv:2005.12246v11027 citations
AI Analysis

This work addresses a fairness problem for users of hate speech classifiers, particularly those using AAE, and is incremental as it builds on existing adversarial training methods.

The paper tackled racial bias in hate speech detection by addressing the high false positive rate for African American English (AAE) text, using adversarial training to reduce this rate substantially while minimally affecting classification performance.

In current hate speech datasets, there exists a high correlation between annotators' perceptions of toxicity and signals of African American English (AAE). This bias in annotated training data and the tendency of machine learning models to amplify it cause AAE text to often be mislabeled as abusive/offensive/hate speech with a high false positive rate by current hate speech classifiers. In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts. Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes