WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation
This work is significant for individuals requiring improved whispered speech conversion, particularly in scenarios with limited parallel data, by providing a scalable data augmentation method.
This paper addresses the challenge of whisper-to-normal (W2N) speech conversion, which suffers from limited parallel data. The authors propose WhispEar, a bidirectional framework that includes a normal-to-whisper (N2W) model to generate pseudo-parallel whispered speech from abundant normal speech, thereby scaling data augmentation for W2N training and improving performance.
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic representations that capture speaking-mode-invariant information shared by whispered and normal speech. The framework contains both W2N and normal-to-whisper (N2W) models. Notably, the N2W model enables zero-shot pseudo-parallel whisper generation from abundant normal speech, allowing scalable data augmentation for W2N training. Increasing generated data consistently improves performance. We also release the largest bilingual (Chinese-English) whispered-normal parallel corpus to date. Experiments demonstrate that WhispEar outperforms strong baselines and benefits significantly from scalable pseudo-parallel data.