FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
This work addresses speech synthesis quality for TTS applications, representing an incremental advancement by combining existing techniques like flow matching with autoregressive models.
The paper tackles the problem of continuous-valued token modeling and temporal coherence in autoregressive speech synthesis by proposing FELLE, which integrates language modeling with token-wise flow matching and a coarse-to-fine mechanism, resulting in significant improvements in TTS generation quality as demonstrated in experimental results.
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.