CLSEApr 23

From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

arXiv:2604.2171636.0
AI Analysis

For researchers and practitioners evaluating fairness in code-generation models, this work shows that simple conditional benchmarks dramatically underestimate bias, necessitating more realistic evaluation tasks.

Prior work underestimates bias in code generation by testing only simple conditional statements. When evaluating ML pipeline generation, sensitive attributes appear in 87.7% of cases, far exceeding the 59.2% observed with conditionals, revealing that current benchmarks miss substantial bias in practical deployments.

Prior work evaluates code generation bias primarily through simple conditional statements, which represent only a narrow slice of real-world programming and reveal solely overt, explicitly encoded bias. We demonstrate that this approach dramatically underestimates bias in practice by examining a more realistic task: generating machine learning (ML) pipelines. Testing both code-specialized and general-instruction large language models, we find that generated pipelines exhibit significant bias during feature selection. Sensitive attributes appear in 87.7% of cases on average, despite models demonstrably excluding irrelevant features (e.g., including "race" while dropping "favorite color" for credit scoring). This bias is substantially more prevalent than that captured by conditional statements, where sensitive attributes appear in only 59.2% of cases. These findings are robust across prompt mitigation strategies, varying numbers of attributes, and different pipeline difficulty levels. Our results challenge simple conditionals as valid proxies for bias evaluation and suggest current benchmarks underestimate bias risk in practical deployments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes