ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps
This work addresses efficiency issues in singing voice synthesis for applications requiring fast generation, though it appears incremental as it builds on existing consistency model techniques.
The authors tackled the problem of slow inference in diffusion-based singing voice synthesis by proposing ConSinger, a method based on consistency models, which achieved high-fidelity synthesis with minimal steps while maintaining competitive quality and speed compared to baselines.
Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.