CL LGOct 14, 2024

Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael Wooldridge, Janet B. Pierrehumbert, Furu Wei

Oxford

arXiv:2410.11005v310.014 citationsh-index: 14Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the problem of dialect bias in LLMs for speakers of non-standard dialects like AAVE, highlighting unfair service in reasoning tasks, and is foundational but incremental in bias analysis.

The study introduced ReDial, a benchmark with 1.2K+ parallel queries in Standardized English and African American Vernacular English (AAVE), to assess LLMs on reasoning tasks, finding that almost all widely used models show significant brittleness and unfairness to AAVE queries.

Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce ReDial (Reasoning with Dialect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research.

View on arXiv PDF Code

Similar