ASAISDSPOct 1, 2025

UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching

arXiv:2510.00771v12 citationsh-index: 4
Originality Highly original
AI Analysis

This addresses the bottleneck of vocoder dependence in audio super-resolution for applications in speech and general audio processing, representing a novel method rather than an incremental improvement.

The paper tackles audio super-resolution by introducing a vocoder-free framework that directly reconstructs waveforms using flow matching, eliminating the need for a separate neural vocoder. It achieves state-of-the-art performance on speech and general audio datasets, producing high-fidelity 48 kHz audio across diverse upsampling factors.

In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes