Improved Deep Speaker Feature Learning for Text-Dependent Speaker Recognition
This work addresses speaker recognition for security or authentication systems, but it is incremental as it builds on prior deep learning methods.
The paper tackled the problem of text-dependent speaker recognition by improving deep speaker feature learning, resulting in a considerable performance improvement over the existing d-vector implementation.
A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phonedependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.