GAP-URGENet: A Generative-Predictive Fusion Framework for Universal Speech Enhancement
This work addresses speech enhancement for audio processing applications, representing an incremental improvement through fusion of existing methods.
The authors tackled universal speech enhancement by proposing GAP-URGENet, a generative-predictive fusion framework that integrates self-supervised restoration and spectrogram enhancement, achieving top performance and ranking 1st in objective evaluation in the ICASSP 2026 URGENT Challenge.
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.