ASFeb 3, 2022
A deep complex multi-frame filtering network for stereophonic acoustic echo cancellationLinjuan Cheng, Chengshi Zheng, Andong Li et al.
In hands-free communication system, the coupling between loudspeaker and microphone generates echo signal, which can severely influence the quality of communication. Meanwhile, various types of noise in communication environments further reduce speech quality and intelligibility. It is difficult to extract the near-end signal from the microphone signal within one step, especially in low signal-to-noise ratio scenarios. In this paper, we propose a deep complex network approach to address this issue. Specially, we decompose the stereophonic acoustic echo cancellation into two stages, including linear stereophonic acoustic echo cancellation module and residual echo suppression module, where both modules are based on deep learning architectures. A multi-frame filtering strategy is introduced to benefit the estimation of linear echo by capturing more inter-frame information. Moreover, we decouple the complex spectral mapping into magnitude estimation and complex spectrum refinement. Experimental results demonstrate that our proposed approach achieves stage-of-the-art performance over previous advanced algorithms under various conditions.
SDNov 3, 2020
Two Heads Are Better Than One: A Two-Stage Approach for Monaural Noise Reduction in the Complex DomainAndong Li, Chengshi Zheng, Renhua Peng et al.
In low signal-to-noise ratio conditions, it is difficult to effectively recover the magnitude and phase information simultaneously. To address this problem, this paper proposes a two-stage algorithm to decouple the joint optimization problem w.r.t. magnitude and phase into two sub-tasks. In the first stage, only magnitude is optimized, which incorporates noisy phase to obtain a coarse complex clean speech spectrum estimation. In the second stage, both the magnitude and phase components are refined. The experiments are conducted on the WSJ0-SI84 corpus, and the results show that the proposed approach significantly outperforms previous baselines in terms of PESQ, ESTOI, and SDR.
SDJun 13, 2020
Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech EnhancementAndong Li, Chengshi Zheng, Renhua Peng et al.
The generative adversarial networks (GANs) have facilitated the development of speech enhancement recently. Nevertheless, the performance advantage is still limited when compared with state-of-the-art models. In this paper, we propose a powerful Dynamic Attention Recursive GAN called DARGAN for noise reduction in the time-frequency domain. Different from previous works, we have several innovations. First, recursive learning, an iterative training protocol, is used in the generator, which consists of multiple steps. By reusing the network in each step, the noise components are progressively reduced in a step-wise manner. Second, the dynamic attention mechanism is deployed, which helps to re-adjust the feature distribution in the noise reduction module. Third, we exploit the deep Griffin-Lim algorithm as the module for phase postprocessing, which facilitates further improvement in speech quality. Experimental results on Voice Bank corpus show that the proposed GAN achieves state-of-the-art performance than previous GAN- and non-GAN-based models
SDMay 12, 2020
The IOA System for Deep Noise Suppression Challenge using a Framework Combining Dynamic Attention and Recursive LearningAndong Li, Chengshi Zheng, Renhua Peng et al.
This technical report describes our system that is submitted to the Deep Noise Suppression Challenge and presents the results for the non-real-time track. To refine the estimation results stage by stage, we utilize recursive learning, a type of training protocol which aggravates the information through multiple stages with a memory mechanism. The attention generator network is designed to dynamically control the feature distribution of the noise reduction network. To improve the phase recovery accuracy, we take the complex spectral mapping procedure by decoding both real and imaginary spectra. For the final blind test set, the average MOS improvements of the submitted system in noreverb, reverb, and realrec categories are 0.49, 0.24, and 0.36, respectively.
SDMar 29, 2020
A Recursive Network with Dynamic Attention for Monaural Speech EnhancementAndong Li, Chengshi Zheng, Cunhang Fan et al.
A person tends to generate dynamic attention towards speech under complicated environments. Based on this phenomenon, we propose a framework combining dynamic attention and recursive learning together for monaural speech enhancement. Apart from a major noise reduction network, we design a separated sub-network, which adaptively generates the attention distribution to control the information flow throughout the major network. To effectively decrease the number of trainable parameters, recursive learning is introduced, which means that the network is reused for multiple stages, where the intermediate output in each stage is correlated with a memory mechanism. As a result, a more flexible and better estimation can be obtained. We conduct experiments on TIMIT corpus. Experimental results show that the proposed architecture obtains consistently better performance than recent state-of-the-art models in terms of both PESQ and STOI scores.
SDMar 22, 2020
A Time-domain Monaural Speech Enhancement with Feedback LearningAndong Li, Chengshi Zheng, Linjuan Cheng et al.
In this paper, we propose a type of neural network with feedback learning in the time domain called FTNet for monaural speech enhancement, where the proposed network consists of three principal components. The first part is called stage recurrent neural network, which is introduced to effectively aggregate the deep feature dependencies across different stages with a memory mechanism and also remove the interference stage by stage. The second part is the convolutional auto-encoder. The third part consists of a series of concatenated gated linear units, which are capable of facilitating the information flow and gradually increasing the receptive fields. Feedback learning is adopted to improve the parameter efficiency and therefore, the number of trainable parameters is effectively reduced without sacrificing its performance. Numerous experiments are conducted on TIMIT corpus and experimental results demonstrate that the proposed network can achieve consistently better performance in terms of both PESQ and STOI scores than two state-of-the-art time domain-based baselines in different conditions.