Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing
This addresses the problem of gender bias in LLMs for safe deployment, offering a practical solution with incremental improvements over prior methods.
The paper tackled gender bias in large language models (LLMs) by proposing an interpretable neuron editing method, which effectively reduced bias while preserving model capabilities, outperforming existing approaches in experiments on five LLMs.
Large language models (LLMs) often exhibit gender bias, posing challenges for their safe deployment. Existing methods to mitigate bias lack a comprehensive understanding of its mechanisms or compromise the model's core capabilities. To address these issues, we propose the CommonWords dataset, to systematically evaluate gender bias in LLMs. Our analysis reveals pervasive bias across models and identifies specific neuron circuits, including gender neurons and general neurons, responsible for this behavior. Notably, editing even a small number of general neurons can disrupt the model's overall capabilities due to hierarchical neuron interactions. Based on these insights, we propose an interpretable neuron editing method that combines logit-based and causal-based strategies to selectively target biased neurons. Experiments on five LLMs demonstrate that our method effectively reduces gender bias while preserving the model's original capabilities, outperforming existing fine-tuning and editing approaches. Our findings contribute a novel dataset, a detailed analysis of bias mechanisms, and a practical solution for mitigating gender bias in LLMs.