CLJun 14, 2021

Mitigating Biases in Toxic Language Detection through Invariant Rationalization

arXiv:2106.07240v1714 citations
Originality Incremental advance
AI Analysis

It addresses unfairness in toxicity detectors that can harm minority groups, representing an incremental improvement over existing debiasing approaches.

The paper tackled biases in toxic language detection by proposing invariant rationalization to reduce spurious correlations, resulting in lower false positive rates for lexical and dialectal attributes compared to previous methods.

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes