ASCLJun 4, 2025

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

arXiv:2506.04518v21 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses efficiency and alignment issues in speech-text joint decoding for spoken dialogue systems, representing an incremental improvement.

The paper tackled the problem of inefficient speech-text joint decoding in speech language models by comparing paradigms and proposing an early-stop interleaved pattern, which accelerated decoding and slightly improved performance.

Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes