ASAICLSDMay 31, 2020

Crossed-Time Delay Neural Network for Speaker Recognition

arXiv:2006.00452v33 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses speaker recognition for applications like security and voice assistants, offering incremental improvements over existing TDNN variants.

The paper tackles speaker recognition by proposing a Crossed-Time Delay Neural Network (CTDNN) that enhances the Time Delay Neural Network (TDNN), achieving a 2.6% absolute Equal Error Rate improvement in verification on VoxCeleb1 and doubling identification accuracy to 90.4% in few-shot conditions.

Time Delay Neural Network (TDNN) is a well-performing structure for DNN-based speaker recognition systems. In this paper we introduce a novel structure Crossed-Time Delay Neural Network (CTDNN) to enhance the performance of current TDNN. Inspired by the multi-filters setting of convolution layer from convolution neural network, we set multiple time delay units each with different context size at the bottom layer and construct a multilayer parallel network. The proposed CTDNN gives significant improvements over original TDNN on both speaker verification and identification tasks. It outperforms in VoxCeleb1 dataset in verification experiment with a 2.6% absolute Equal Error Rate improvement. In few shots condition CTDNN reaches 90.4% identification accuracy, which doubles the identification accuracy of original TDNN. We also compare the proposed CTDNN with another new variant of TDNN, FTDNN, which shows that our model has a 36% absolute identification accuracy improvement under few shots condition and can better handle training of a larger batch in a shorter training time, which better utilize the calculation resources. The code of the new model is released at https://github.com/chenllliang/CTDNN

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes