VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model
This addresses the challenge of accurate speaker verification for short-duration speech, which is incremental as it builds on diffusion models for a specific bottleneck.
The paper tackles the problem of speaker verification performance degrading with short utterances by proposing VoiceExtender, a method using guided diffusion models to augment speech features, resulting in relative EER improvements of up to 46.1% on the VoxCeleb1 dataset.
Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.