LGAISDASFeb 6, 2025

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

arXiv:2502.04465v221 citationsh-index: 31
AI Analysis

This addresses the need for efficient low-bitrate speech coding for applications in communication and generative modeling, representing a novel method rather than an incremental improvement.

The paper tackles the problem of high bitrates and information loss in neural speech codecs by introducing FocalCodec, which uses focal modulation and a single binary codebook to compress speech at 0.16-0.65 kbps, achieving competitive performance in tasks like resynthesis and voice conversion.

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes