FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction
This work addresses the problem of slow speech synthesis for real-time applications, though it is incremental as it builds on existing WaveRNN and LPCNet methods.
The paper tackles the inefficiency of LPCNet for online speech generation by proposing FeatherWave, a neural vocoder that combines multi-band linear predictive coding with WaveRNN, achieving 9x faster-than-real-time audio generation at 24 kHz on a single CPU with better quality than LPCNet.
In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding. The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core. However, LPCNet is still not efficient enough for online speech generation tasks. To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder. The multi-band method enables the model to generate several speech samples in parallel at one step. Therefore, it can significantly improve the efficiency of speech synthesis. The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation. In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder. Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.