AS CL SDMay 19, 2025

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee

arXiv:2505.13237v316.739 citationsh-index: 10Has CodeINTERSPEECH

Originality Synthesis-oriented

AI Analysis

This addresses a critical limitation in multimodal AI for researchers, though it is incremental as it focuses on evaluation rather than solving the problem.

The paper tackled the underexplored multi-hop reasoning abilities of large audio-language models (LALMs) by introducing the SAKURA benchmark, and found that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, highlighting a fundamental challenge.

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

View on arXiv PDF Code

Similar