CLLGOct 14, 2024

Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Oxford
arXiv:2410.11005v314 citationsh-index: 14ACL
Originality Incremental advance
AI Analysis

This work addresses the problem of dialect bias in LLMs for speakers of non-standard dialects like AAVE, highlighting unfair service in reasoning tasks, and is foundational but incremental in bias analysis.

The study introduced ReDial, a benchmark with 1.2K+ parallel queries in Standardized English and African American Vernacular English (AAVE), to assess LLMs on reasoning tasks, finding that almost all widely used models show significant brittleness and unfairness to AAVE queries.

Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce ReDial (Reasoning with Dialect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes