AS CL LG SDOct 31, 2024

DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models

Heng-Jui Chang, Hongyu Gong, Changhan Wang, James Glass, Yu-An Chung

MIT

arXiv:2410.24177v19.27 citationsh-index: 66INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses speaker invariance in speech processing for spoken language models, offering incremental improvements in tokenization efficiency and robustness.

The paper tackles the problem of speaker variability in speech tokenization for spoken language models by introducing DC-Spin, a method that extracts speaker-invariant tokens rich in phonetic information, resulting in enhanced zero-shot tasks and speech resynthesis.

Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.

View on arXiv PDF

Similar