Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings
This addresses the need for more realistic bias evaluation in LLMs for AI safety and fairness applications, though it is incremental as it builds on existing datasets and methods.
The paper tackles the problem of evaluating social bias in Large Language Models (LLMs) in open-ended settings, where existing benchmarks are limited to predefined formats, by extending the BBQ dataset to Open-BBQ with fill-in-the-blank and short-answer questions and proposing Composite Prompting, an in-context learning method that reduces bias by 15-20% for GPT models while maintaining high accuracy.
Current social bias benchmarks for Large Language Models (LLMs) primarily rely on predefined question formats like multiple-choice, limiting their ability to reflect the complexity and open-ended nature of real-world interactions. To close this gap, we extend an existing dataset BBQ (Parrish et al., 2022) to Open-BBQ, a comprehensive framework to evaluate the social bias of LLMs in open-ended settings by incorporating two additional question categories: fill-in-the-blank and short-answer. Since our new Open-BBQ dataset contains a lot of open-ended responses like sentences and paragraphs, we developed an evaluation process to detect biases from open-ended content by labeling sentences and paragraphs. In addition to this, we also found that existing debiasing methods, such as self-debiasing (Gallegos et al., 2024), have over-correction issues, which make the original correct answers incorrect. In order to solve this issue, we propose Composite Prompting, an In-context Learning (ICL) method combining structured examples with explicit chain-of-thought reasoning to form a unified instruction template for LLMs to explicitly identify content that needs debiasing. Experimental results show that the proposed method significantly reduces the bias for both GPT-3.5 and GPT-4o while maintaining high accuracy.