AS SDMay 26

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

arXiv:2605.2703958.0

Predicted impact top 59% in AS · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on long-context audio language models, this work pinpoints a specific failure mode (representational drift) in acoustic memory, offering a diagnostic framework for future improvements.

The paper identifies representational trajectory drift as the primary cause of poor non-speech acoustic memory in large audio language models, while attention allocation plays a minor role. This finding is based on a new benchmark, EnvMem, and post-hoc intervention analysis.

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

View on arXiv PDF

Similar