CLSep 25, 2025

VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model

Junhyuk Choi, Ro-hoon Oh, Jihwan Seol, Bugeun Kim

arXiv:2509.21108v14 citationsh-index: 2EMNLP

Originality Synthesis-oriented

AI Analysis

This provides a compact testbed for diagnosing bias in spoken language models, addressing a domain-specific problem for researchers and developers in speech AI.

The authors tackled the problem of measuring social bias in spoken language models by introducing VoiceBBQ, a dataset that extends a text benchmark to speech, enabling evaluation of bias from content and acoustic aspects. They found that LLaMA-Omni resists acoustic bias but amplifies gender and accent bias, while Qwen2-Audio dampens these cues while preserving content fidelity.

We introduce VoiceBBQ, a spoken extension of the BBQ (Bias Benchmark for Question Answering) - a dataset that measures social bias by presenting ambiguous or disambiguated contexts followed by questions that may elicit stereotypical responses. Due to the nature of speech, social bias in Spoken Language Models (SLMs) can emerge from two distinct sources: 1) content aspect and 2) acoustic aspect. The dataset converts every BBQ context into controlled voice conditions, enabling per-axis accuracy, bias, and consistency scores that remain comparable to the original text benchmark. Using VoiceBBQ, we evaluate two SLMs - LLaMA-Omni and Qwen2-Audio - and observe architectural contrasts: LLaMA-Omni resists acoustic bias while amplifying gender and accent bias, whereas Qwen2-Audio substantially dampens these cues while preserving content fidelity. VoiceBBQ thus provides a compact, drop-in testbed for jointly diagnosing content and acoustic bias across spoken language models.

View on arXiv PDF

Similar