Danwei Cai

AS
8papers
277citations
Novelty40%
AI Score23

8 Papers

ASSep 7, 2021
The DKU-DukeECE System for the Self-Supervision Speaker Verification Task of the 2021 VoxCeleb Speaker Recognition Challenge

Danwei Cai, Ming Li

This report describes the submission of the DKU-DukeECE team to the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC). Our method employs an iterative labeling framework to learn self-supervised speaker representation based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Also, visual modal data is incorporated in this self-labeling framework. The visual pseudo label and the audio pseudo label are fused with a cluster ensemble algorithm to generate a robust supervisory signal for representation learning. Our submission achieves an equal error rate (EER) of 5.58% and 5.59% on the challenge development and test set, respectively.

ASSep 5, 2021
The DKU-DukeECE-Lenovo System for the Diarization Task of the 2021 VoxCeleb Speaker Recognition Challenge

Weiqing Wang, Danwei Cai, Qingjian Lin et al.

This report describes the submission of the DKU-DukeECE-Lenovo team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2021 track 4. Our system including a voice activity detection (VAD) model, a speaker embedding model, two clustering-based speaker diarization systems with different similarity measurements, two different overlapped speech detection (OSD) models, and a target-speaker voice activity detection (TS-VAD) model. Our final submission, consisting of 5 independent systems, achieves a DER of 5.07% on the challenge test set.

ASFeb 6, 2021
The DKU-Duke-Lenovo System Description for the Third DIHARD Speech Diarization Challenge

Weiqing Wang, Qingjian Lin, Danwei Cai et al.

In this paper, we present the submitted system for the third DIHARD Speech Diarization Challenge from the DKU-Duke-Lenovo team. Our system consists of several modules: voice activity detection (VAD), segmentation, speaker embedding extraction, attentive similarity scoring, agglomerative hierarchical clustering. In addition, the target speaker VAD (TSVAD) is used for the phone call data to further improve the performance. Our final submitted system achieves a DER of 15.43% for the core evaluation set and 13.39% for the full evaluation set on task 1, and we also get a DER of 21.63% for core evaluation set and 18.90% for full evaluation set on task 2.

ASOct 25, 2020
An iterative framework for self-supervised deep speaker representation learning

Danwei Cai, Weiqing Wang, Ming Li

In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Speaker verification experiments are conducted on the VoxCeleb dataset. The results show that our proposed iterative self-supervised learning framework outperformed previous works using self-supervision. The speaker network after 5 iterations obtains a 61% performance gain over the speaker embedding model trained with contrastive loss.

ASFeb 3, 2020
Within-sample variability-invariant loss for robust speaker recognition under noisy environments

Danwei Cai, Weicheng Cai, Ming Li

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the "clean" embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.

ASJul 5, 2019
The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Weicheng Cai, Haiwei Wu, Danwei Cai et al.

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for anti-spoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.

ASFeb 20, 2019
Utterance-level end-to-end language identification using attention-based CNN-BLSTM

Weicheng Cai, Danwei Cai, Shen Huang et al.

In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variable-length inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-the-art utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.

ASSep 9, 2018
End-to-end Language Identification using NetFV and NetVLAD

Jinkun Chen, Weicheng Cai, Danwei Cai et al.

In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of feature vectors into a fixed dimensional vector which is very important to process those variable-length utterances. We first present the relevances and differences between the classical i-vector and the aforementioned encoding schemes. Then, we construct a flexible end-to-end framework including a convolutional neural network (CNN) architecture and an encoding layer (NetFV or NetVLAD) for the language identification task. Experimental results on the NIST LRE 2007 close-set task show that the proposed system achieves significant EER reductions against the conventional i-vector baseline and the CNN temporal average pooling system, respectively.