Measuring and Mitigating Bias in Code Generated by Large Language Models
For developers and users of LLM-based code generation tools, this work highlights the persistent challenge of bias and the inadequacy of lightweight mitigation strategies.
The paper evaluates bias in code generated by GPT-4o and Gemini using two metrics (CBS and ACR), and tests four mitigation strategies (Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, Multi-agent). Results show bias persists across attributes and datasets even after mitigation, indicating current methods are insufficient.
Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code generation tasks. However, concerns about bias in their generated outputs remain significant. This paper focuses on GPT-4o and Gemini, mainstream tools for code generation, and proposes a framework for evaluating bias in LLM-generated code, specifically examining the influence of protected attributes, prompts and web-search capability. We use two metrics: the code bias score (CBS) and the attribute change ratio (ACR), to quantify the prevalence of bias and the degree of influence of different attributes, respectively. In addition, we investigate four lightweight mitigation strategies: Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, and Multi-agent, aimed at mitigating bias in generated code. Our findings reveal that bias remains prevalent across different protected attributes and datasets even after applying mitigation strategies, highlighting the need for more effective approaches to reduce bias in AI-driven code generation systems.