CLLGJul 24, 2024

Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation

arXiv:2407.16951v14 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses bias mitigation in LLMs for safer AI applications, but it appears incremental as it builds on existing unlearning techniques.

The paper tackles bias in large language models by proposing an unlearning-based method that uses gradient ascent on hate speech to reduce harmful content, showing effectiveness in diminishing bias while preserving language abilities and revealing cross-domain transfer potential.

Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias form (e.g. gender) may contribute to mitigating others (e.g. race and religion).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes