AS AI HC SDSep 10, 2023

VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, Kai Yu

arXiv:2309.05027v318.872 citationsh-index: 68Has Code

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in text-to-speech synthesis for applications requiring faster generation, though it appears incremental as it builds on existing flow matching techniques.

The paper tackled the inefficiency of diffusion models in text-to-speech synthesis by proposing VoiceFlow, which uses a rectified flow matching algorithm to achieve high-quality synthesis with fewer sampling steps, as validated by subjective and objective evaluations.

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.

View on arXiv PDF Code

Similar