ASSDSep 3, 2020

Dense CNN with Self-Attention for Time-Domain Speech Enhancement

arXiv:2009.01941v2165 citations
Originality Highly original
AI Analysis

This work addresses speech enhancement for audio processing applications, offering a novel method with strong specific gains.

The authors tackled speech enhancement in the time domain by proposing a dense convolutional network with self-attention and a novel loss function based on magnitudes and predicted noise, resulting in substantial outperformance over state-of-the-art causal and non-causal approaches.

Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes