CLAIJun 1, 2022

What Changed? Investigating Debiasing Methods using Causal Mediation Analysis

arXiv:2206.00701v1630 citationsh-index: 25
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding debiasing mechanisms in language models for researchers, but it is incremental as it builds on prior studies without introducing a new method.

The paper investigates why debiasing methods for language models have varying impacts on downstream tasks by applying causal mediation analysis to understand internal mechanisms, such as neurons and layers, with findings suggesting a need to test effectiveness with different bias metrics and focus on specific model components.

Previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. However, what we don't understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. In this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers of language models, and attention heads.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes