Comparison of Multiple Features and Modeling Methods for Text-dependent Speaker Verification
This work addresses performance gaps in speaker verification for security applications, but it is incremental as it compares existing methods without introducing new ones.
The paper tackled the problem of text-dependent speaker verification by comparing four modeling methods and bottleneck features on the RedDots dataset, finding that HMM-based methods with explicit lexical modeling performed well in fixed-phrase conditions but struggled in prompted-phrase conditions, and bottleneck features did not outperform MFCCs on challenging trials.
Text-dependent speaker verification is becoming popular in the speaker recognition society. However, the conventional i-vector framework which has been successful for speaker identification and other similar tasks works relatively poorly in this task. Researchers have proposed several new methods to improve performance, but it is still unclear that which model is the best choice, especially when the pass-phrases are prompted during enrollment and test. In this paper, we introduce four modeling methods and compare their performance on the newly published RedDots dataset. To further explore the influence of different frame alignments, Viterbi and forward-backward algorithms are both used in the HMM-based models. Several bottleneck features are also investigated. Our experiments show that, by explicitly modeling the lexical content, the HMM-based modeling achieves good results in the fixed-phrase condition. In the prompted-phrase condition, GMM-HMM and i-vector/HMM are not as successful. In both conditions, the forward-backward algorithm brings more benefits to the i-vector/HMM system. Additionally, we also find that even though bottleneck features perform well for text-independent speaker verification, they do not outperform MFCCs on the most challenging Imposter-Correct trials on RedDots.