CLSDASMay 5, 2020

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

arXiv:2005.01972v24 citations
AI Analysis

This work addresses the lack of end-to-end recognition for whispered speech, which is important for applications like healthcare or privacy, but it is incremental as it builds on existing methods with domain-specific adaptations.

The paper tackles the problem of end-to-end whispered speech recognition by addressing data scarcity and leveraging whispered speech characteristics, achieving a 19.8% relative reduction in PER and 44.4% in CER on the whispered TIMIT corpus.

Whispering is an important mode of human speech, but no end-to-end recognition results for it were reported yet, probably due to the scarcity of available whispered speech data. In this paper, we present several approaches for end-to-end (E2E) recognition of whispered speech considering the special characteristics of whispered speech and the scarcity of data. This includes a frequency-weighted SpecAugment policy and a frequency-divided CNN feature extractor for better capturing the high-frequency structures of whispered speech, and a layer-wise transfer learning approach to pre-train a model with normal or normal-to-whispered converted speech then fine-tune it with whispered speech to bridge the gap between whispered and normal speech. We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus. The results indicate as long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes