CLMar 17, 2022

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

arXiv:2203.09192v1664 citationsh-index: 45
Originality Highly original
AI Analysis

This addresses unintended bias in hate speech detection models, offering a knowledge-free alternative to list-based methods, though it is incremental as it builds on existing attention mechanisms.

The paper tackled the problem of NLP models overfitting to specific terms in training data, which reduces performance and fairness, by proposing an entropy-based attention regularization method that matches or exceeds state-of-the-art performance on hate speech classification and bias metrics across three benchmark corpora in English and Italian.

Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy. We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian. EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes