ASLGSDSPSep 18, 2025

Real-Time Streaming Mel Vocoding with Generative Flow Matching

arXiv:2509.15085v11 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the problem of low-latency speech synthesis for real-time applications, though it is incremental as it builds on prior work like DiffPhase.

The paper tackled real-time Mel vocoding for speech synthesis by developing MelFlow, a streaming-capable generative vocoder with a total latency of 48 ms, achieving better PESQ and SI-SDR scores than non-streaming baselines like HiFi-GAN.

The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes