VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
For researchers and developers of LALMs, this work provides a more realistic bias evaluation framework, highlighting that current models reproduce social stereotypes in open-ended tasks.
VIBE evaluates generative bias in Large Audio-Language Models using open-ended tasks with real-world speech, revealing that gender cues cause larger distributional shifts than accent cues, indicating systematic social stereotype reproduction.
Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairness. We propose VIBE, a framework that evaluates generative bias through open-ended tasks such as personalized recommendations, using real-world human recordings. Unlike MCQs, our method allows stereotypical associations to manifest organically without predefined options, making it easily extensible to new tasks. Evaluating 11 state-of-the-art LALMs reveals systematic biases in realistic scenarios. We find that gender cues often trigger larger distributional shifts than accent cues, indicating that current LALMs reproduce social stereotypes.