ASCLSDJun 18, 2025

Factorized RVQ-GAN For Disentangled Speech Tokenization

arXiv:2506.15456v11 citationsh-index: 21INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the need for interpretable and high-quality discrete speech representations for downstream speech generation and understanding tasks, though it is incremental as it builds on existing methods like HuBERT and LaBSE.

The paper tackled the problem of creating a unified neural speech codec with a factorized bottleneck for disentangled linguistic levels, resulting in tokens that align with phonemes and word-level semantics while outperforming baselines in disentanglement and reconstruction quality.

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes