Factorized RVQ-GAN For Disentangled Speech Tokenization
This work addresses the need for interpretable and high-quality discrete speech representations for downstream speech generation and understanding tasks, though it is incremental as it builds on existing methods like HuBERT and LaBSE.
The paper tackled the problem of creating a unified neural speech codec with a factorized bottleneck for disentangled linguistic levels, resulting in tokens that align with phonemes and word-level semantics while outperforming baselines in disentanglement and reconstruction quality.
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.