ASLGSDOct 25, 2019

A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet

arXiv:1910.11615v264 citations
Originality Incremental advance
AI Analysis

This work addresses speech separation for audio processing applications, but it is incremental as it modifies an existing method by replacing a learned component with a deterministic one.

The paper tackled the problem of speech separation by investigating whether the learned encoder in Conv-TasNet can be replaced with a deterministic hand-crafted filterbank, and found that using a multi-phase gammatone filterbank improved scale-invariant source-to-noise ratio by 0.7 dB and reduced the number of filters from 512 to 128 without performance loss.

In this work, we investigate if the learned encoder of the end-to-end convolutional time domain audio separation network (Conv-TasNet) is the key to its recent success, or if the encoder can just as well be replaced by a deterministic hand-crafted filterbank. Motivated by the resemblance of the trained encoder of Conv-TasNet to auditory filterbanks, we propose to employ a deterministic gammatone filterbank. In contrast to a common gammatone filterbank, our filters are restricted to 2 ms length to allow for low-latency processing. Inspired by the encoder learned by Conv-TasNet, in addition to the logarithmically spaced filters, the proposed filterbank holds multiple gammatone filters at the same center frequency with varying phase shifts. We show that replacing the learned encoder with our proposed multi-phase gammatone filterbank (MP-GTF) even leads to a scale-invariant source-to-noise ratio (SI-SNR) improvement of 0.7 dB. Furthermore, in contrast to using the learned encoder we show that the number of filters can be reduced from 512 to 128 without loss of performance.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes