SDLGFeb 12

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

arXiv:2602.11909v21 citationsh-index: 23Has Code
AI Analysis

This addresses the problem of limited audio comprehension in AI systems, offering a novel approach for applications requiring advanced audio analysis, though it is incremental in improving existing LALM methods.

The paper tackles the information bottleneck in Large Audio Language Models (LALMs) by proposing audio-interleaved reasoning, which enables dynamic re-listening to audio during reasoning, and Echo achieves overall superiority on audio comprehension benchmarks.

The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning. On audio comprehension benchmarks, Echo achieves overall superiority in both challenging expert-level and general-purpose tasks. Comprehensive analysis further confirms the efficiency and generalizability of audio-interleaved reasoning, establishing it as a promising direction for advancing audio comprehension. Project page: https://github.com/wdqqdw/Echo.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes