CLAILGASMar 27

Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

arXiv:2603.2624664.3h-index: 17
AI Analysis

This addresses the computational inefficiency of using conversational context in ASR for speech recognition systems, though it is incremental as it builds on existing multi-turn training methods.

The paper tackles the problem of leveraging conversational context in LLM-based ASR by proposing Abstract Compression, which replaces raw audio from prior turns with learned latent tokens to reduce computational cost. The compressed model recovers part of the performance gains of raw-context conditioning on in-domain and out-of-domain test sets.

Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes