ASAISDFeb 27, 2025

CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

arXiv:2502.20040v25 citationsh-index: 6IEEE Transactions on Audio, Speech, and Language Processing
Originality Incremental advance
AI Analysis

This work addresses speech enhancement for improving both human perception and machine recognition, but it is incremental as it builds on existing Mel-spectrogram methods.

The authors tackled speech enhancement by proposing CleanMel, a Mel-spectrogram denoising and dereverberation network, which improved both speech quality and ASR performance, as demonstrated on six datasets.

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to the speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.Code and audio examples of our model are available online.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes