Evaluation of African American Language Bias in Natural Language Generation
This addresses bias in AI for African American communities, though it is incremental as it focuses on evaluation rather than solving the bias.
The paper evaluated how well large language models (LLMs) understand African American Language (AAL) compared to White Mainstream English (WME), finding performance gaps that suggest bias and lack of understanding of AAL features.
We evaluate how well LLMs understand African American Language (AAL) in comparison to their performance on White Mainstream English (WME), the encouraged "standard" form of English taught in American classrooms. We measure LLM performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates AAL (or WME) given WME (or AAL), and a masked span prediction (MSP) task, where models predict a phrase that was removed from their input. Our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of AAL text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in WME; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of AAL features.