CLSDASApr 9, 2025

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

arXiv:2504.07053v220 citationsh-index: 18
Originality Incremental advance
AI Analysis

This addresses the problem of more natural human-LLM interaction through improved spoken language modeling, representing an incremental advancement in joint speech-text methods.

The paper tackles the modality gap in joint speech-text modeling for spoken language models by introducing TASTE, a method that aligns speech tokens with text transcription during tokenization, resulting in comparable performance on benchmarks like SALMON and StoryCloze while significantly outperforming other models on speech continuation tasks.

Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes