A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss
This work addresses speech quality enhancement for audio processing applications, representing an incremental improvement by integrating complex domain modeling with attention mechanisms.
The paper tackles speech super-resolution by reconstructing both magnitude and phase in the complex domain, achieving state-of-the-art performance on the VCTK dataset, especially for extreme upsampling from 2 kHz to 48 kHz with reduced noisy artifacts.
Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme upsampling (2 kHz to 48 kHz), reconstructing high frequencies effectively without noisy artifacts.