Monaural Speech Enhancement with Complex Convolutional Block Attention Module and Joint Time Frequency Losses
This work offers an incremental improvement to existing speech enhancement models by enhancing feature representation and optimizing with a mixed loss function.
This paper proposes a complex convolutional block attention module (CCBAM) and a joint time-frequency loss function to improve monaural speech enhancement. By integrating these components into existing deep complex U-Net and CRN architectures, the authors achieve superior performance in objective evaluations.
Deep complex U-Net structure and convolutional recurrent network (CRN) structure achieve state-of-the-art performance for monaural speech enhancement. Both deep complex U-Net and CRN are encoder and decoder structures with skip connections, which heavily rely on the representation power of the complex-valued convolutional layers. In this paper, we propose a complex convolutional block attention module (CCBAM) to boost the representation power of the complex-valued convolutional layers by constructing more informative features. The CCBAM is a lightweight and general module which can be easily integrated into any complex-valued convolutional layers. We integrate CCBAM with the deep complex U-Net and CRN to enhance their performance for speech enhancement. We further propose a mixed loss function to jointly optimize the complex models in both time-frequency (TF) domain and time domain. By integrating CCBAM and the mixed loss, we form a new end-to-end (E2E) complex speech enhancement framework. Ablation experiments and objective evaluations show the superior performance of the proposed approaches (https://github.com/modelscope/ClearerVoice-Studio).