ASSDMay 26

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

arXiv:2605.2703958.0
Predicted impact top 59% in AS · last 90 daysOriginality Incremental advance
AI Analysis

For researchers working on long-context audio language models, this work pinpoints a specific failure mode (representational drift) in acoustic memory, offering a diagnostic framework for future improvements.

The paper identifies representational trajectory drift as the primary cause of poor non-speech acoustic memory in large audio language models, while attention allocation plays a minor role. This finding is based on a new benchmark, EnvMem, and post-hoc intervention analysis.

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes