ASCLSDMay 19, 2025

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

arXiv:2505.13237v333 citationsh-index: 10INTERSPEECH
Originality Synthesis-oriented
AI Analysis

This addresses a critical limitation in multimodal AI for researchers, though it is incremental as it focuses on evaluation rather than solving the problem.

The paper tackled the underexplored multi-hop reasoning abilities of large audio-language models (LALMs) by introducing the SAKURA benchmark, and found that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, highlighting a fundamental challenge.

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes