Information Gain-Guided Causal Intervention for Autonomous Debiasing Large Language Models
This addresses the issue of poor generalizability in LLMs due to dataset biases, but it is incremental as it builds on prior debiasing methods by combining causal mechanisms with information theory.
The authors tackled the problem of dataset biases in large language models (LLMs) that reduce generalizability by proposing an information gain-guided causal intervention debiasing (ICD) framework, which automatically rewrites instruction-tuning data to balance distributions and reduce bias information gain, resulting in improved generalizability across tasks.
Despite significant progress, recent studies indicate that current large language models (LLMs) may still capture dataset biases and utilize them during inference, leading to the poor generalizability of LLMs. However, due to the diversity of dataset biases and the insufficient nature of bias suppression based on in-context learning, the effectiveness of previous prior knowledge-based debiasing methods and in-context learning based automatic debiasing methods is limited. To address these challenges, we explore the combination of causal mechanisms with information theory and propose an information gain-guided causal intervention debiasing (ICD) framework. To eliminate biases within the instruction-tuning dataset, it is essential to ensure that these biases do not provide any additional information to predict the answers, i.e., the information gain of these biases for predicting the answers needs to be 0. Under this guidance, this framework utilizes a causal intervention-based data rewriting method to automatically and autonomously balance the distribution of instruction-tuning dataset for reducing the information gain. Subsequently, it employs a standard supervised fine-tuning process to train LLMs on the debiased dataset. Experimental results show that ICD can effectively debias LLM to improve its generalizability across different tasks.