Towards Universal Neural Vocoding with a Multi-band Excited WaveNet
This work addresses the need for efficient and versatile voice synthesis models for applications in speech and singing generation, though it is incremental by building on existing excitation vocoder frameworks.
The paper tackles the problem of creating a universal neural vocoder for generating high-quality voice signals from arbitrary mel spectrograms, achieving perceptive quality comparable to state-of-the-art models while using significantly smaller training datasets and fewer parameters.
This paper introduces the Multi-Band Excited WaveNet a neural vocoder for speaking and singing voices. It aims to advance the state of the art towards an universal neural vocoder, which is a model that can generate voice signals from arbitrary mel spectrograms extracted from voice signals. Following the success of the DDSP model and following the development of the recently proposed excitation vocoders we propose a vocoder structure consisting of multiple specialized DNN that are combined with dedicated signal processing components. All components are implemented as differentiable operators and therefore allow joined optimization of the model parameters. To prove the capacity of the model to reproduce high quality voice signals we evaluate the model on single and multi speaker/singer datasets. We conduct a subjective evaluation demonstrating that the models support a wide range of domain variations (unseen voices, languages, expressivity) achieving perceptive quality that compares with a state of the art universal neural vocoder, however using significantly smaller training datasets and significantly less parameters. We also demonstrate remaining limits of the universality of neural vocoders e.g. the creation of saturated singing voices.