Atss-Net: Target Speaker Separation via Attention-based Neural Network
This work addresses speaker separation for audio processing applications, representing an incremental improvement over existing methods.
The paper tackles target speaker separation by proposing Atss-Net, an attention-based neural network in the spectrogram domain, which outperforms VoiceFilter with half the parameters and shows promise in speech enhancement.
Recently, Convolutional Neural Network (CNN) and Long short-term memory (LSTM) based models have been introduced to deep learning-based target speaker separation. In this paper, we propose an Attention-based neural network (Atss-Net) in the spectrogram domain for the task. It allows the network to compute the correlation between each feature parallelly, and using shallower layers to extract more features, compared with the CNN-LSTM architecture. Experimental results show that our Atss-Net yields better performance than the VoiceFilter, although it only contains half of the parameters. Furthermore, our proposed model also demonstrates promising performance in speech enhancement.