CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space
This addresses the problem of improving speech clarity and intelligibility for audio applications, with incremental improvements in efficiency and fidelity.
The paper tackles speech bandwidth extension by restoring high-frequency content from low-bandwidth speech, achieving strong spectral fidelity and enhanced perceptual quality on tasks like 8 kHz to 16 kHz and 44.1 kHz conversion.
Speech Bandwidth Extension improves clarity and intelligibility by restoring/inferring appropriate high-frequency content for low-bandwidth speech. Existing methods often rely on spectrogram or waveform modeling, which can incur higher computational cost and have limited high-frequency fidelity. Neural audio codecs offer compact latent representations that better preserve acoustic detail, yet accurately recovering high-resolution latent information remains challenging due to representation mismatch. We present CodecFlow, a neural codec-based BWE framework that performs efficient speech reconstruction in a compact latent space. CodecFlow employs a voicing-aware conditional flow converter on continuous codec embeddings and a structure-constrained residual vector quantizer to improve latent alignment stability. Optimized end-to-end, CodecFlow achieves strong spectral fidelity and enhanced perceptual quality on 8 kHz to 16 kHz and 44.1 kHz speech BWE tasks.