Unlocking Temporal Flexibility: Neural Speech Codec with Variable Frame Rate
This addresses the problem of bitrate and token sequence inefficiency in real-time speech applications, representing an incremental improvement.
The paper tackled the inefficiency of constant frame rate (CFR) in neural speech codecs by proposing a Temporally Flexible Coding (TFC) technique with variable frame rate (VFR), achieving optimal reconstruction quality and competitive performance at lower frame rates.
Most neural speech codecs achieve bitrate adjustment through intra-frame mechanisms, such as codebook dropout, at a Constant Frame Rate (CFR). However, speech segments inherently have time-varying information density (e.g., silent intervals versus voiced regions). This property makes CFR not optimal in terms of bitrate and token sequence length, hindering efficiency in real-time applications. In this work, we propose a Temporally Flexible Coding (TFC) technique, introducing variable frame rate (VFR) into neural speech codecs for the first time. TFC enables seamlessly tunable average frame rates and dynamically allocates frame rates based on temporal entropy. Experimental results show that a codec with TFC achieves optimal reconstruction quality with high flexibility, and maintains competitive performance even at lower frame rates. Our approach is promising for the integration with other efforts to develop low-frame-rate neural speech codecs for more efficient downstream tasks.