Promoting Equality in Large Language Models: Identifying and Mitigating the Implicit Bias based on Bayesian Theory
This addresses the issue of subtle, camouflaged biases in LLMs that affect fairness across demographic groups, representing a novel method for a known bottleneck in bias mitigation.
The paper tackles the problem of implicit bias in large language models (LLMs) by formally defining it and developing a Bayesian theory-based framework (BTBR) for bias removal, which uses likelihood ratio screening and model editing to expunge biases, with experiments confirming the problem and demonstrating BTBR's effectiveness.
Large language models (LLMs) are trained on extensive text corpora, which inevitably include biased information. Although techniques such as Affective Alignment can mitigate some negative impacts of these biases, existing prompt-based attack methods can still extract these biases from the model's weights. Moreover, these biases frequently appear subtly when LLMs are prompted to perform identical tasks across different demographic groups, thereby camouflaging their presence. To address this issue, we have formally defined the implicit bias problem and developed an innovative framework for bias removal based on Bayesian theory, Bayesian-Theory based Bias Removal (BTBR). BTBR employs likelihood ratio screening to pinpoint data entries within publicly accessible biased datasets that represent biases inadvertently incorporated during the LLM training phase. It then automatically constructs relevant knowledge triples and expunges bias information from LLMs using model editing techniques. Through extensive experimentation, we have confirmed the presence of the implicit bias problem in LLMs and demonstrated the effectiveness of our BTBR approach.