CLSDAug 25, 2025

Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

arXiv:2508.17863v18 citationsh-index: 12EMNLP
Originality Synthesis-oriented
AI Analysis

This work addresses a performance gap in speech processing methods for SpeechLLMs, providing insights for researchers in spoken language understanding, though it is incremental as it focuses on comparative analysis.

The paper compared discrete tokens and continuous features for spoken language understanding in SpeechLLMs, finding that continuous features generally outperformed discrete tokens across six tasks using models like Qwen1.5-0.5B and Llama3.1-8B.

With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes