ASAIJun 20, 2025

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

arXiv:2506.16741v11 citationsh-index: 8INTERSPEECH
Originality Incremental advance
AI Analysis

This addresses the problem of slow inference in TTS for applications requiring real-time synthesis, representing an incremental improvement over existing flow matching techniques.

The paper tackled the trade-off between quality and inference speed in text-to-speech (TTS) generation by introducing RapFlow-TTS, which uses velocity consistency constraints in flow matching to reduce synthesis steps. It achieved high-fidelity speech with a 5- and 10-fold reduction in steps compared to conventional methods.

We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes