ASAISDSPSYJun 1, 2023

A Multi-dimensional Deep Structured State Space Approach to Speech Enhancement Using Small-footprint Models

Georgia Tech
arXiv:2306.00331v126 citationsh-index: 73
AI Analysis

This work addresses speech enhancement for applications requiring efficient models, but it is incremental as it modifies existing S4 layers for better spectral dependency capture.

The paper tackled speech enhancement by proposing a multi-dimensional structured state space (S4) approach to build small-footprint models, achieving competitive performance with a PESQ score of 3.15 and a 78.6% reduction in model size compared to a conventional U-net model.

We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains. The 2-D S4 layer can be considered a particular convolutional layer with an infinite receptive field although it utilizes fewer parameters than a conventional convolutional layer. Evaluated on the VoiceBank-DEMAND data set, when compared with the conventional U-net model based on convolutional layers, the proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation. By increasing the model size, we can even reach a PESQ score of 3.18.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes