The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs
This work addresses fairness issues in speech AI systems for users with diverse accents and gender presentations, though it is incremental as it quantifies rather than mitigates bias.
The researchers quantified intersectional bias in SpeechLLMs by evaluating accent and gender bias across 2,880 controlled interactions, finding that Eastern European-accented speech received lower helpfulness scores, especially for female-presenting voices, with human evaluators detecting sharper disparities than LLM judges.
Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect consistent disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. The bias is implicit: responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, uncovering sharper intersectional disparities.