SDCLASMar 20, 2025

WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching

arXiv:2503.16689v116 citationsh-index: 14Has CodeNAACL
Originality Incremental advance
AI Analysis

This work addresses audio quality and speed issues in speech synthesis for applications like text-to-speech systems, though it is incremental as it builds on existing flow matching and diffusion vocoder methods.

The paper tackles the problem of subpar audio quality in flow matching-based neural vocoders by introducing WaveFM, which uses a mel-conditioned prior distribution and auxiliary losses to improve sample quality and a tailored consistency distillation method to speed up inference. The model achieves superior performance in quality and efficiency compared to previous diffusion vocoders, enabling waveform generation in a single step.

Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes