SD AI ASOct 2, 2025

Exploring Resolution-Wise Shared Attention in Hybrid Mamba-U-Nets for Improved Cross-Corpus Speech Enhancement

Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard, Zheng-Hua Tan

arXiv:2510.01958v17.01 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses cross-corpus generalization in speech enhancement, an incremental improvement for applications like hearing aids or communication systems.

The authors tackled cross-corpus speech enhancement by proposing RWSA-MambaUNet, a hybrid model combining Mamba and attention in a U-Net structure, which achieved state-of-the-art generalization on out-of-domain test sets with reduced parameters and FLOPs, e.g., surpassing baselines on DNS 2020 in PESQ, SSNR, and ESTOI.

Recent advances in speech enhancement have shown that models combining Mamba and attention mechanisms yield superior cross-corpus generalization performance. At the same time, integrating Mamba in a U-Net structure has yielded state-of-the-art enhancement performance, while reducing both model size and computational complexity. Inspired by these insights, we propose RWSA-MambaUNet, a novel and efficient hybrid model combining Mamba and multi-head attention in a U-Net structure for improved cross-corpus performance. Resolution-wise shared attention (RWSA) refers to layerwise attention-sharing across corresponding time- and frequency resolutions. Our best-performing RWSA-MambaUNet model achieves state-of-the-art generalization performance on two out-of-domain test sets. Notably, our smallest model surpasses all baselines on the out-of-domain DNS 2020 test set in terms of PESQ, SSNR, and ESTOI, and on the out-of-domain EARS-WHAM_v2 test set in terms of SSNR, ESTOI, and SI-SDR, while using less than half the model parameters and a fraction of the FLOPs.

View on arXiv PDF

Similar