When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

arXiv:2602.11488v21.12 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work identifies modality arbitration as a distinct reliability issue in audio-LLMs, affecting their performance across languages and models, but it is incremental in analyzing an existing problem.

The study found that speech-enabled language models follow text over audio 10 times more often when modalities conflict, with Gemini 2.0 Flash showing 16.6% text dominance in audio-text conflicts versus 1.6% in text-text conflicts, despite audio embeddings preserving more information than text transcripts.

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6% text dominance under audio-text conflict versus 1.6% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2%) exceeds cascade accuracy (93.9%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19% to 33%), sacrificing audio's information advantage without improving accessibility. Framing text as "deliberately corrupted" reduces text dominance by 80%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5%), while LoRA on the language model halves it ($-$23.9%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

View on arXiv PDF

Similar