SD AI ASJul 25, 2025

HH-Codec: High Compression High-fidelity Discrete Neural Codec for Spoken Language Modeling

Rongkun Xue, Yazhe Niu, Shuai Hu, Zixin Yin, Yongqiang Yao, Jing Yang

arXiv:2507.18897v19.32 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency and fidelity issues in speech processing for applications like generative models, though it appears incremental as it builds on existing neural codec methods.

The paper tackles the challenge of high compression and computational cost in discrete speech tokenization for large-scale speech-to-speech systems by introducing HH-Codec, which achieves state-of-the-art speech reconstruction at 0.3 kbps with 24 tokens per second for 24 kHz audio.

Discrete speech tokenization is a fundamental component in speech codecs. However, in large-scale speech-to-speech systems, the complexity of parallel streams from multiple quantizers and the computational cost of high-time-dimensional codecs pose significant challenges. In this paper, we introduce HH-Codec, a neural codec that achieves extreme compression at 24 tokens per second for 24 kHz audio while relying on single-quantizer inference. Our approach involves a carefully designed Vector Quantization space for Spoken Language Modeling, optimizing compression efficiency while minimizing information loss. Building on this, we propose an asymmetric encoder-decoder architecture (Audio-VQ-Mel-Audio) that leverages dual supervision and progressive training to enhance reconstruction stability and fidelity. HH-Codec achieves state-of-the-art performance in speech reconstruction with an ultra-low bandwidth of 0.3 kbps. We further evaluate its effectiveness in codebook utilization and generative model adaptation, with extensive ablations validating the necessity of each module. HH-Codec is available at https://github.com/opendilab/HH-Codec.

View on arXiv PDF Code

Similar