CLAISep 27, 2025

Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

arXiv:2510.02352v14 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses fairness issues in audio-based interactive systems for real-world decision-making and recommendation tasks, providing the first systematic study of biases in end-to-end spoken dialogue models.

The paper systematically evaluates biases in spoken dialogue LLMs, finding that closed-source models generally have lower bias while open-source models are more sensitive to age and gender, with recommendation tasks amplifying cross-group disparities and biases persisting in multi-turn conversations.

While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes