FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

Junseok Lee, Sangyong Lee, Chang-Jae Chun

arXiv:2601.06199v21 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of computational inefficiency in long-form speech processing for AI researchers and developers, offering an incremental improvement in token compression.

The paper tackles the bottleneck of scaling Multimodal Large Language Models to long-form speech by proposing FastSLM, which uses a Hierarchical Frame Querying Transformer to compress speech tokens by 93% to 1.67 tokens per second, achieving competitive performance on benchmarks with lower computational costs.

Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67 tokens per second, achieving a 93 percent reduction in tokens compared to standard frame-level adapters, while preserving the critical context required for complex reasoning. Experimental results demonstrate that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks, despite operating with significantly lower FLOPs and parameter counts. Our findings establish that extreme token compression is a viable pathway to making real-time, long-context speech understanding feasible for LLMs, even under strict computational constraints. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3

View on arXiv PDF

Similar