SD AI ASApr 17, 2021

MIMO Self-attentive RNN Beamformer for Multi-speaker Speech Separation

Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu

arXiv:2104.08450v214.216 citations

Originality Incremental advance

AI Analysis

This is an incremental improvement for speech processing applications, enhancing beamforming techniques in noisy environments.

The paper tackles multi-speaker speech separation by proposing a MIMO self-attentive RNN beamformer, which improves automatic speech recognition accuracy and perceptual speech quality over prior methods.

Recently, our proposed recurrent neural network (RNN) based all deep learning minimum variance distortionless response (ADL-MVDR) beamformer method yielded superior performance over the conventional MVDR by replacing the matrix inversion and eigenvalue decomposition with two recurrent neural networks. In this work, we present a self-attentive RNN beamformer to further improve our previous RNN-based beamformer by leveraging on the powerful modeling capability of self-attention. Temporal-spatial self-attention module is proposed to better learn the beamforming weights from the speech and noise spatial covariance matrices. The temporal self-attention module could help RNN to learn global statistics of covariance matrices. The spatial self-attention module is designed to attend on the cross-channel correlation in the covariance matrices. Furthermore, a multi-channel input with multi-speaker directional features and multi-speaker speech separation outputs (MIMO) model is developed to improve the inference efficiency. The evaluations demonstrate that our proposed MIMO self-attentive RNN beamformer improves both the automatic speech recognition (ASR) accuracy and the perceptual estimation of speech quality (PESQ) against prior arts.

View on arXiv PDF

Similar