QUANT-PHFeb 27, 2023
Optimizing Quantum Federated Learning Based on Federated Quantum Natural Gradient DescentJun Qi, Xiao-Lei Zhang, Javier Tejedor
Quantum federated learning (QFL) is a quantum extension of the classical federated learning model across multiple local quantum devices. An efficient optimization algorithm is always expected to minimize the communication overhead among different quantum participants. In this work, we propose an efficient optimization algorithm, namely federated quantum natural gradient descent (FQNGD), and further, apply it to a QFL framework that is composed of a variational quantum circuit (VQC)-based quantum neural networks (QNN). Compared with stochastic gradient descent methods like Adam and Adagrad, the FQNGD algorithm admits much fewer training iterations for the QFL to get converged. Moreover, it can significantly reduce the total communication overhead among local quantum devices. Our experiments on a handwritten digit classification dataset justify the effectiveness of the FQNGD for the QFL framework in terms of a faster convergence rate on the training set and higher accuracy on the test set.
SDOct 30, 2022
Symmetric Saliency-based Adversarial Attack To Speaker IdentificationJiadi Yao, Xing Chen, Xiao-Lei Zhang et al.
Adversarial attack approaches to speaker identification either need high computational cost or are not very effective, to our knowledge. To address this issue, in this paper, we propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED), to generate adversarial voice examples to speaker identification. It contains two novel components. First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system, so as to make the attacker focus on generating artificial noise to the important samples. It also proposes an angular loss function to push the speaker embedding far away from the source speaker. Our experimental results demonstrate that the proposed SSED yields the state-of-the-art performance, i.e. over 97% targeted attack success rate and a signal-to-noise level of over 39 dB on both the open-set and close-set speaker identification tasks, with a low computational cost.
ASNov 2, 2022
LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker VerificationXing Chen, Jie Wang, Xiao-Lei Zhang et al.
Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses.
CVJul 25, 2022
Improving Pseudo Labels With Intra-Class Similarity for Unsupervised Domain AdaptationJie Wang, Xiao-Lei Zhang
Unsupervised domain adaptation (UDA) transfers knowledge from a label-rich source domain to a different but related fully-unlabeled target domain. To address the problem of domain shift, more and more UDA methods adopt pseudo labels of the target samples to improve the generalization ability on the target domain. However, inaccurate pseudo labels of the target samples may yield suboptimal performance with error accumulation during the optimization process. Moreover, once the pseudo labels are generated, how to remedy the generated pseudo labels is far from explored. In this paper, we propose a novel approach to improve the accuracy of the pseudo labels in the target domain. It first generates coarse pseudo labels by a conventional UDA method. Then, it iteratively exploits the intra-class similarity of the target samples for improving the generated coarse pseudo labels, and aligns the source and target domains with the improved pseudo labels. The accuracy improvement of the pseudo labels is made by first deleting dissimilar samples, and then using spanning trees to eliminate the samples with the wrong pseudo labels in the intra-class samples. We have applied the proposed approach to several conventional UDA methods as an additional term. Experimental results demonstrate that the proposed method can boost the accuracy of the pseudo labels and further lead to more discriminative and domain invariant features than the conventional baselines.
SDFeb 21, 2023
Interpretable Spectrum Transformation Attacks to Speaker RecognitionJiadi Yao, Hong Luo, Xiao-Lei Zhang
The success of adversarial attacks to speaker recognition is mainly in white-box scenarios. When applying the adversarial voices that are generated by attacking white-box surrogate models to black-box victim models, i.e. \textit{transfer-based} black-box attacks, the transferability of the adversarial voices is not only far from satisfactory, but also lacks interpretable basis. To address these issues, in this paper, we propose a general framework, named spectral transformation attack based on modified discrete cosine transform (STA-MDCT), to improve the transferability of the adversarial voices to a black-box victim model. Specifically, we first apply MDCT to the input voice. Then, we slightly modify the energy of different frequency bands for capturing the salient regions of the adversarial noise in the time-frequency domain that are critical to a successful attack. Unlike existing approaches that operate voices in the time domain, the proposed framework operates voices in the time-frequency domain, which improves the interpretability, transferability, and imperceptibility of the attack. Moreover, it can be implemented with any gradient-based attackers. To utilize the advantage of model ensembling, we not only implement STA-MDCT with a single white-box surrogate model, but also with an ensemble of surrogate models. Finally, we visualize the saliency maps of adversarial voices by the class activation maps (CAM), which offers an interpretable basis to transfer-based attacks in speaker recognition for the first time. Extensive comparison results with five representative attackers show that the CAM visualization clearly explains the effectiveness of STA-MDCT, and the weaknesses of the comparison methods; the proposed method outperforms the comparison methods by a large margin.
SDOct 23, 2020Code
Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer AttentionMenglong Xu, Shengqiang Li, Xiao-Lei Zhang
Recently, several studies reported that dot-product selfattention (SA) may not be indispensable to the state-of-theart Transformer models. Motivated by the fact that dense synthesizer attention (DSA), which dispenses with dot products and pairwise interactions, achieved competitive results in many language processing tasks, in this paper, we first propose a DSA-based speech recognition, as an alternative to SA. To reduce the computational complexity and improve the performance, we further propose local DSA (LDSA) to restrict the attention scope of DSA to a local range around the current central frame for speech recognition. Finally, we combine LDSA with SA to extract the local and global information simultaneously. Experimental results on the Ai-shell1 Mandarine speech recognition corpus show that the proposed LDSA-Transformer achieves a character error rate (CER) of 6.49%, which is slightly better than that of the SA-Transformer. Meanwhile, the LDSA-Transformer requires less computation than the SATransformer. The proposed combination method not only achieves a CER of 6.18%, which significantly outperforms the SA-Transformer, but also has roughly the same number of parameters and computational complexity as the latter. The implementation of the multi-head LDSA is available at https://github.com/mlxu995/multihead-LDSA.
SDFeb 26, 2025
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion ModelLei Zhao, Sizhou Chen, Linfeng Feng et al.
Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.
MMFeb 6, 2025
UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video GenerationLei Zhao, Linfeng Feng, Dongxu Ge et al.
With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.
AIJul 21, 2025
HAMLET: Hyperadaptive Agent-based Modeling for Live Embodied TheatricsSizhou Chen, Shufan Jiang, Chi Zhang et al.
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language model (LLM) is providing a new path to achieve this goal. However, existing LLM-based drama generation methods often result in agents that lack initiative and cannot interact with the physical scene. Furthermore, these methods typically require detailed user input to drive the drama. These limitations reduce the interactivity and immersion of online real-time performance. To address the above challenges, we propose HAMLET, a multi-agent framework focused on drama creation and online performance. Given a simple topic, the framework generates a narrative blueprint, guiding the subsequent improvisational performance. During the online performance, each actor is given an autonomous mind. This means that actors can make independent decisions based on their own background, goals, and emotional state. In addition to conversations with other actors, their decisions can also change the state of scene props through actions such as opening a letter or picking up a weapon. The change is then broadcast to other related actors, updating what they know and care about, which in turn influences their next action. To evaluate the quality of drama performance generated by HAMLET, we designed an evaluation method to assess three primary aspects, including character performance, narrative quality, and interaction experience. The experimental evaluation shows that HAMLET can create expressive and coherent theatrical experiences.
SDOct 12, 2021
Multi-Channel Far-Field Speaker Verification with Large-Scale Ad-hoc Microphone ArraysChengdong Liang, Yijiang Chen, Jiadi Yao et al.
Speaker verification based on ad-hoc microphone arrays has the potential of reducing the error significantly in adverse acoustic environments. However, existing approaches extract utterance-level speaker embeddings from each channel of an ad-hoc microphone array, which does not consider fully the spatial-temporal information across the devices. In this paper, we propose to aggregate the multichannel signals of the ad-hoc microphone array at the frame-level by exploring the cross-channel information deeply with two attention mechanisms. The first one is a self-attention method. It consists of a cross-frame self-attention layer and a cross-channel self-attention layer successively, both working at the frame level. The second one learns the cross-frame and cross-channel information via two graph attention layers. Experimental results demonstrate that the proposed methods reach the state-of-the-art performance. Moreover, the graph-attention method is better than the self-attention method in most cases.
SDJul 13, 2021
Conformer-based End-to-end Speech Recognition With Rotary Position EmbeddingShengqiang Li, Menglong Xu, Xiao-Lei Zhang
Transformer-based end-to-end speech recognition models have received considerable attention in recent years due to their high training speed and ability to model a long-range global context. Position embedding in the transformer architecture is indispensable because it provides supervision for dependency modeling between elements at different positions in the input sequence. To make use of the time order of the input sequence, many works inject some information about the relative or absolute position of the element into the input sequence. In this work, we investigate various position embedding methods in the convolution-augmented transformer (conformer) and adopt a novel implementation named rotary position embedding (RoPE). RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. To evaluate the effectiveness of the RoPE method, we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show that the conformer enhanced with RoPE achieves superior performance in the speech recognition task. Specifically, our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
ASJul 13, 2021
AUC Optimization for Robust Small-footprint Keyword Spotting with Limited Training DataMenglong Xu, Shengqiang Li, Chengdong Liang et al.
Deep neural networks provide effective solutions to small-footprint keyword spotting (KWS). However, if training data is limited, it remains challenging to achieve robust and highly accurate KWS in real-world scenarios where unseen sounds that are out of the training data are frequently encountered. Most conventional methods aim to maximize the classification accuracy on the training set, without taking the unseen sounds into account. To enhance the robustness of the deep neural networks based KWS, in this paper, we introduce a new loss function, named the maximization of the area under the receiver-operating-characteristic curve (AUC). The proposed method not only maximizes the classification accuracy of keywords on the closed training set, but also maximizes the AUC score for optimizing the performance of non-keyword segments detection. Experimental results on the Google Speech Commands dataset v1 and v2 show that our method achieves new state-of-the-art performance in terms of most evaluation metrics.
LGJul 5, 2021
fMBN-E: Efficient Unsupervised Network Structure Ensemble and Selection for ClusteringXiao-Lei Zhang
It is known that unsupervised nonlinear dimensionality reduction and clustering is sensitive to the selection of hyperparameters, particularly for deep learning based methods, which hinders its practical use. How to select a proper network structure that may be dramatically different in different applications is a hard issue for deep models, given little prior knowledge of data. In this paper, we aim to automatically determine the optimal network structure of a deep model, named multilayer bootstrap networks (MBN), via simple ensemble learning and selection techniques. Specifically, we first propose an MBN ensemble (MBN-E) algorithm which concatenates the sparse outputs of a set of MBN base models with different network structures into a new representation. Then, we take the new representation produced by MBN-E as a reference for selecting the optimal MBN base models. Moreover, we propose a fast version of MBN-E (fMBN-E), which is not only theoretically even faster than a single standard MBN but also does not increase the estimation error of MBN-E. Importantly, MBN-E and its ensemble selection techniques maintain the simple formulation of MBN that is based on one-nearest-neighbor learning. Empirically, comparing to a number of advanced deep clustering methods and as many as 20 representative unsupervised ensemble learning and selection methods, the proposed methods reach the state-of-the-art performance without manual hyperparameter tuning. fMBN-E is empirically even hundreds of times faster than MBN-E without suffering performance degradation. The applications to image segmentation and graph data mining further demonstrate the advantage of the proposed methods.
SDJul 1, 2021
Attention-based multi-channel speaker verification with ad-hoc microphone arraysChengdong Liang, Junqi Chen, Shanzheng Guan et al.
Recently, ad-hoc microphone array has been widely studied. Unlike traditional microphone array settings, the spatial arrangement and number of microphones of ad-hoc microphone arrays are not known in advance, which hinders the adaptation of traditional speaker verification technologies to ad-hoc microphone arrays. To overcome this weakness, in this paper, we propose attention-based multi-channel speaker verification with ad-hoc microphone arrays. Specifically, we add an inter-channel processing layer and a global fusion layer after the pooling layer of a single-channel speaker verification system. The inter-channel processing layer applies a so-called residual self-attention along the channel dimension for allocating weights to different microphones. The global fusion layer integrates all channels in a way that is independent to the number of the input channels. We further replace the softmax operator in the residual self-attention with sparsemax, which forces the channel weights of very noisy channels to zero. Experimental results with ad-hoc microphone arrays of over 30 channels demonstrate the effectiveness of the proposed methods. For example, the multi-channel speaker verification with sparsemax achieves an equal error rate (EER) of over 20% lower than oracle one-best system on semi-real data sets, and over 30% lower on simulation data sets, in test scenarios with both matched and mismatched channel numbers.
SDApr 14, 2021
Efficient conformer-based speech recognition with linear attentionShengqiang Li, Menglong Xu, Xiao-Lei Zhang
Recently, conformer-based end-to-end automatic speech recognition, which outperforms recurrent neural network based ones, has received much attention. Although the parallel computing of conformer is more efficient than recurrent neural networks, the computational complexity of its dot-product self-attention is quadratic with respect to the length of the input feature. To reduce the computational complexity of the self-attention layer, we propose multi-head linear self-attention for the self-attention layer, which reduces its computational complexity to linear order. In addition, we propose to factorize the feed forward module of the conformer by low-rank matrix factorization, which successfully reduces the number of the parameters by approximate 50% with little performance loss. The proposed model, named linear attention based conformer (LAC), can be trained and inferenced jointly with the connectionist temporal classification objective, which further improves the performance of LAC. To evaluate the effectiveness of LAC, we conduct experiments on the AISHELL-1 and LibriSpeech corpora. Results show that the proposed LAC achieves better performance than 7 recently proposed speech recognition models, and is competitive with the state-of-the-art conformer. Meanwhile, the proposed LAC has a number of parameters of only 50% over the conformer with faster training speed than the latter.
SDMar 29, 2021
Transformer-based end-to-end speech recognition with residual Gaussian-based self-attentionChengdong Liang, Menglong Xu, Xiao-Lei Zhang
Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relative-position-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based self-attention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86% on the test set, which is relative 7.8% lower than that of the SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.
ASMar 29, 2021
Scaling sparsemax based channel selection for speech recognition with ad-hoc microphone arraysJunqi Chen, Xiao-Lei Zhang
Recently, speech recognition with ad-hoc microphone arrays has received much attention. It is known that channel selection is an important problem of ad-hoc microphone arrays, however, this topic seems far from explored in speech recognition yet, particularly with a large-scale ad-hoc microphone array. To address this problem, we propose a Scaling Sparsemax algorithm for the channel selection problem of the speech recognition with large-scale ad-hoc microphone arrays. Specifically, we first replace the conventional Softmax operator in the stream attention mechanism of a multichannel end-to-end speech recognition system with Sparsemax, which conducts channel selection by forcing the channel weights of noisy channels to zero. Because Sparsemax punishes the weights of many channels to zero harshly, we propose Scaling Sparsemax which punishes the channels mildly by setting the weights of very noisy channels to zero only. Experimental results with ad-hoc microphone arrays of over 30 channels under the conformer speech recognition architecture show that the proposed Scaling Sparsemax yields a word error rate of over 30% lower than Softmax on simulation data sets, and over 20% lower on semi-real data sets, in test scenarios with both matched and mismatched channel numbers.
IRFeb 24, 2021
Deep NMF Topic ModelingJianYu Wang, Xiao-Lei Zhang
Nonnegative matrix factorization (NMF) based topic modeling methods do not rely on model- or data-assumptions much. However, they are usually formulated as difficult optimization problems, which may suffer from bad local minima and high computational complexity. In this paper, we propose a deep NMF (DNMF) topic modeling framework to alleviate the aforementioned problems. It first applies an unsupervised deep learning method to learn latent hierarchical structures of documents, under the assumption that if we could learn a good representation of documents by, e.g. a deep model, then the topic word discovery problem can be boosted. Then, it takes the output of the deep model to constrain a topic-document distribution for the discovery of the discriminant topic words, which not only improves the efficacy but also reduces the computational complexity over conventional unsupervised NMF methods. We constrain the topic-document distribution in three ways, which takes the advantages of the three major sub-categories of NMF -- basic NMF, structured NMF, and constrained NMF respectively. To overcome the weaknesses of deep neural networks in unsupervised topic modeling, we adopt a non-neural-network deep model -- multilayer bootstrap network. To our knowledge, this is the first time that a deep NMF model is used for unsupervised topic modeling. We have compared the proposed method with a number of representative references covering major branches of topic modeling on a variety of real-world text corpora. Experimental results illustrate the effectiveness of the proposed method under various evaluation metrics.
SDJan 16, 2021
Minimum-volume Multichannel Nonnegative matrix factorization for blind source separationJianyu Wang, Shanzheng Guan, Shupei Liu et al.
Multichannel blind audio source separation aims to recover the latent sources from their multichannel mixtures without supervised information. One state-of-the-art blind audio source separation method, named independent low-rank matrix analysis (ILRMA), unifies independent vector analysis (IVA) and nonnegative matrix factorization (NMF). However, the spectra matrix produced from NMF may not find a compact spectral basis. It may not guarantee the identifiability of each source as well. To address this problem, here we propose to enhance the identifiability of the source model by a minimum-volume prior distribution. We further regularize a multichannel NMF (MNMF) and ILRMA respectively with the minimum-volume regularizer. The proposed methods maximize the posterior distribution of the separated sources, which ensures the stability of the convergence. Experimental results demonstrate the effectiveness of the proposed methods compared with auxiliary independent vector analysis, MNMF, ILRMA and its extensions.
SDDec 1, 2020
Deep Ad-hoc Beamforming Based on Speaker Extraction for Target-Dependent Speech SeparationZiye Yang, Shanzheng Guan, Xiao-Lei Zhang
Recently, the research on ad-hoc microphone arrays with deep learning has drawn much attention, especially in speech enhancement and separation. Because an ad-hoc microphone array may cover such a large area that multiple speakers may locate far apart and talk independently, target-dependent speech separation, which aims to extract a target speaker from a mixed speech, is important for extracting and tracing a specific speaker in the ad-hoc array. However, this technique has not been explored yet. In this paper, we propose deep ad-hoc beamforming based on speaker extraction, which is to our knowledge the first work for target-dependent speech separation based on ad-hoc microphone arrays and deep learning. The algorithm contains three components. First, we propose a supervised channel selection framework based on speaker extraction, where the estimated utterance-level SNRs of the target speech are used as the basis for the channel selection. Second, we apply the selected channels to a deep learning based MVDR algorithm, where a single-channel speaker extraction algorithm is applied to each selected channel for estimating the mask of the target speech. We conducted an extensive experiment on a WSJ0-adhoc corpus. Experimental results demonstrate the effectiveness of the proposed method.
SDNov 29, 2020
A comparison of handcrafted, parameterized, and learnable features for speech separationWenbo Zhu, Mou Wang, Xiao-Lei Zhang et al.
The design of acoustic features is important for speech separation. It can be roughly categorized into three classes: handcrafted, parameterized, and learnable features. Among them, learnable features, which are trained with separation networks jointly in an end-to-end fashion, become a new trend of modern speech separation research, e.g. convolutional time domain audio separation network (Conv-Tasnet), while handcrafted and parameterized features are also shown competitive in very recent studies. However, a systematic comparison across the three kinds of acoustic features has not been conducted yet. In this paper, we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features. We also generalize the handcrafted multi-phase gammatone filterbank (MPGTF) to a new parameterized multi-phase gammatone filterbank (ParaMPGTF). Experimental results on the WSJ0-2mix corpus show that (i) if the decoder is learnable, then setting the encoder to STFT, MPGTF, ParaMPGTF, and learnable features lead to similar performance; and (ii) when the pseudo-inverse transforms of STFT, MPGTF, and ParaMPGTF are used as the decoders, the proposed ParaMPGTF performs better than the other two handcrafted features.
ASOct 23, 2020
Speech enhancement aided end-to-end multi-task learning for voice activity detectionXu Tan, Xiao-Lei Zhang
Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we propose a speech enhancement aided end-to-end multi-task model for VAD. The model has two decoders, one for speech enhancement and the other for VAD. The two decoders share the same encoder and speech separation network. Unlike the direct thought that takes two separated objectives for VAD and speech enhancement respectively, here we propose a new joint optimization objective -- VAD-masked scale-invariant source-to-distortion ratio (mSI-SDR). mSI-SDR uses VAD information to mask the output of the speech enhancement decoder in the training process. It makes the VAD and speech enhancement tasks jointly optimized not only at the shared encoder and separation network, but also at the objective level. It also satisfies real-time working requirement theoretically. Experimental results show that the multi-task method significantly outperforms its single-task VAD counterpart. Moreover, mSI-SDR outperforms SI-SDR in the same multi-task setting.
SDApr 25, 2020
Depthwise Separable Convolutional ResNet with Squeeze-and-Excitation Blocks for Small-footprint Keyword SpottingMenglong Xu, Xiao-Lei Zhang
One difficult problem of keyword spotting is how to miniaturize its memory footprint while maintain a high precision. Although convolutional neural networks have shown to be effective to the small-footprint keyword spotting problem, they still need hundreds of thousands of parameters to achieve good performance. In this paper, we propose an efficient model based on depthwise separable convolution layers and squeeze-and-excitation blocks. Specifically, we replace the standard convolution by the depthwise separable convolution, which reduces the number of the parameters of the standard convolution without significant performance degradation. We further improve the performance of the depthwise separable convolution by reweighting the output feature maps of the first convolution layer with a so-called squeeze-and-excitation block. We compared the proposed method with five representative models on two experimental settings of the Google Speech Commands dataset. Experimental results show that the proposed method achieves the state-of-the-art performance. For example, it achieves a classification error rate of 3.29% with a number of parameters of 72K in the first experiment, which significantly outperforms the comparison methods given a similar model size. It achieves an error rate of 3.97% with a number of parameters of 10K, which is also slightly better than the state-of-the-art comparison method given a similar model size.
LGNov 19, 2019
Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verificationZhongxin Bai, Xiao-Lei Zhang, Jingdong Chen
Deep embedding based text-independent speaker verification has demonstrated superior performance to traditional methods in many challenging scenarios. Its loss functions can be generally categorized into two classes, i.e., verification and identification. The verification loss functions match the pipeline of speaker verification, but their implementations are difficult. Thus, most state-of-the-art deep embedding methods use the identification loss functions with softmax output units or their variants. In this paper, we propose a verification loss function, named the maximization of partial area under the Receiver-operating-characteristic (ROC) curve (pAUC), for deep embedding based text-independent speaker verification. We also propose a class-center based training trial construction method to improve the training efficiency, which is critical for the proposed loss function to be comparable to the identification loss in performance. Experiments on the Speaker in the Wild (SITW) and NIST SRE 2016 datasets show that the proposed pAUC loss function is highly competitive with the state-of-the-art identification loss functions.
LGOct 24, 2019
Deep topic modeling by multilayer bootstrap network and lassoJianyu Wang, Xiao-Lei Zhang
Topic modeling is widely studied for the dimension reduction and analysis of documents. However, it is formulated as a difficult optimization problem. Current approximate solutions also suffer from inaccurate model- or data-assumptions. To deal with the above problems, we propose a polynomial-time deep topic model with no model and data assumptions. Specifically, we first apply multilayer bootstrap network (MBN), which is an unsupervised deep model, to reduce the dimension of documents, and then use the low-dimensional data representations or their clustering results as the target of supervised Lasso for topic word discovery. To our knowledge, this is the first time that MBN and Lasso are applied to unsupervised topic modeling. Experimental comparison results with five representative topic models on the 20-newsgroups and TDT2 corpora illustrate the effectiveness of the proposed algorithm.
SDOct 24, 2019
Multi-channel Speech Separation Using Deep Embedding Model with Multilayer Bootstrap NetworksZiye Yang, Xiao-Lei Zhang
Recently, deep clustering (DPCL) based speaker-independent speech separation has drawn much attention, since it needs little speaker prior information. However, it still has much room of improvement, particularly in reverberant environments. If the training and test environments mismatch which is a common case, the embedding vectors produced by DPCL may contain much noise and many small variations. To deal with the problem, we propose a variant of DPCL, named DPCL++, by applying a recent unsupervised deep learning method---multilayer bootstrap networks(MBN)---to further reduce the noise and small variations of the embedding vectors in an unsupervised way in the test stage, which fascinates k-means to produce a good result. MBN builds a gradually narrowed network from bottom-up via a stack of k-centroids clustering ensembles, where the k-centroids clusterings are trained independently by random sampling and one-nearest-neighbor optimization. To further improve the robustness of DPCL++ in reverberant environments, we take spatial features as part of its input. Experimental results demonstrate the effectiveness of the proposed method.
SDNov 3, 2018
Deep Ad-hoc BeamformingXiao-Lei Zhang
Far-field speech processing is an important and challenging problem. In this paper, we propose \textit{deep ad-hoc beamforming}, a deep-learning-based multichannel speech enhancement framework based on ad-hoc microphone arrays, to address the problem. It contains three novel components. First, it combines \textit{ad-hoc microphone arrays} with deep-learning-based multichannel speech enhancement, which reduces the probability of the occurrence of far-field acoustic environments significantly. Second, it groups the microphones around the speech source to a local microphone array by a supervised channel selection framework based on deep neural networks. Third, it develops a simple time synchronization framework to synchronize the channels that have different time delay. Besides the above novelties and advantages, the proposed model is also trained in a single-channel fashion, so that it can easily employ new development of speech processing techniques. Its test stage is also flexible in incorporating any number of microphones without retraining or modifying the framework. We have developed many implementations of the proposed framework and conducted an extensive experiment in scenarios where the locations of the speech sources are far-field, random, and blind to the microphones. Results on speech enhancement tasks show that our method outperforms its counterpart that works with linear microphone arrays by a considerable margin in both diffuse noise reverberant environments and point source noise reverberant environments. We have also tested the framework with different handcrafted features. Results show that although designing good features lead to high performance, they do not affect the conclusion on the effectiveness of the proposed framework.
SDFeb 12, 2018
Linear Regression for Speaker VerificationXiao-Lei Zhang
This paper presents a linear regression based back-end for speaker verification. Linear regression is a simple linear model that minimizes the mean squared estimation error between the target and its estimate with a closed form solution, where the target is defined as the ground-truth indicator vectors of utterances. We use the linear regression model to learn speaker models from a front-end, and verify the similarity of two speaker models by a cosine similarity scoring classifier. To evaluate the effectiveness of the linear regression model, we construct three speaker verification systems that use the Gaussian mixture model and identity-vector (GMM/i-vector) front-end, deep neural network and i-vector (DNN/i-vector) front-end, and deep vector (d-vector) front-end as their front-ends, respectively. Our empirical comparison results on the NIST speaker recognition evaluation data sets show that the proposed method outperforms within-class covariance normalization, linear discriminant analysis, and probabilistic linear discriminant analysis, given any of the three front-ends.
LGAug 1, 2017
Learning the kernel matrix by resamplingXiao-Lei Zhang
In this abstract paper, we introduce a new kernel learning method by a nonparametric density estimator. The estimator consists of a group of k-centroids clusterings. Each clustering randomly selects data points with randomly selected features as its centroids, and learns a one-hot encoder by one-nearest-neighbor optimization. The estimator generates a sparse representation for each data point. Then, we construct a nonlinear kernel matrix from the sparse representation of data. One major advantage of the proposed kernel method is that it is relatively insensitive to its free parameters, and therefore, it can produce reasonable results without parameter tuning. Another advantage is that it is simple. We conjecture that the proposed method can find its applications in many learning tasks or methods where sparse representation or kernel matrix is explored. In this preliminary study, we have applied the kernel matrix to spectral clustering. Our experimental results demonstrate that the kernel generated by the proposed method outperforms the well-tuned Gaussian RBF kernel. This abstract paper is used to protect the idea, full versions will be updated later.
SDSep 24, 2015
An Investigation of Universal Background Sparse Coding Based Speaker Verification on TIMITXiao-Lei Zhang
In this paper, we propose a universal background model, named universal background sparse coding (UBSC), for speaker verification. The proposed method trains an ensemble of clusterings by data resampling, and produces sparse codes from the clusterings by one-nearest-neighbor optimization plus binarization. The main advantage of UBSC is that it does not suffer from local minima and does not make Gaussian assumptions on data distributions. We evaluated UBSC on a clean speech corpus---TIMIT. We used the cosine similarity and inner product similarity as the scoring methods of a trial. Experimental results show that UBSC is comparable to Gaussian mixture model.
LGSep 21, 2015
Multilayer bootstrap network for unsupervised speaker recognitionXiao-Lei Zhang
We apply multilayer bootstrap network (MBN), a recent proposed unsupervised learning method, to unsupervised speaker recognition. The proposed method first extracts supervectors from an unsupervised universal background model, then reduces the dimension of the high-dimensional supervectors by multilayer bootstrap network, and finally conducts unsupervised speaker recognition by clustering the low-dimensional data. The comparison results with 2 unsupervised and 1 supervised speaker recognition techniques demonstrate the effectiveness and robustness of the proposed method.
LGMar 22, 2015
Unsupervised model compression for multilayer bootstrap networksXiao-Lei Zhang
Recently, multilayer bootstrap network (MBN) has demonstrated promising performance in unsupervised dimensionality reduction. It can learn compact representations in standard data sets, i.e. MNIST and RCV1. However, as a bootstrap method, the prediction complexity of MBN is high. In this paper, we propose an unsupervised model compression framework for this general problem of unsupervised bootstrap methods. The framework compresses a large unsupervised bootstrap model into a small model by taking the bootstrap model and its application together as a black box and learning a mapping function from the input of the bootstrap model to the output of the application by a supervised learner. To specialize the framework, we propose a new technique, named compressive MBN. It takes MBN as the unsupervised bootstrap model and deep neural network (DNN) as the supervised learner. Our initial result on MNIST showed that compressive MBN not only maintains the high prediction accuracy of MBN but also is over thousands of times faster than MBN at the prediction stage. Our result suggests that the new technique integrates the effectiveness of MBN on unsupervised learning and the effectiveness and efficiency of DNN on supervised learning together for the effectiveness and efficiency of compressive MBN on unsupervised learning.
LGDec 3, 2014
Deep Distributed Random Samplings for Supervised Learning: An Alternative to Random Forests?Xiao-Lei Zhang
In (\cite{zhang2014nonlinear,zhang2014nonlinear2}), we have viewed machine learning as a coding and dimensionality reduction problem, and further proposed a simple unsupervised dimensionality reduction method, entitled deep distributed random samplings (DDRS). In this paper, we further extend it to supervised learning incrementally. The key idea here is to incorporate label information into the coding process by reformulating that each center in DDRS has multiple output units indicating which class the center belongs to. The supervised learning method seems somewhat similar with random forests (\cite{breiman2001random}), here we emphasize their differences as follows. (i) Each layer of our method considers the relationship between part of the data points in training data with all training data points, while random forests focus on building each decision tree on only part of training data points independently. (ii) Our method builds gradually-narrowed network by sampling less and less data points, while random forests builds gradually-narrowed network by merging subclasses. (iii) Our method is trained more straightforward from bottom layer to top layer, while random forests build each tree from top layer to bottom layer by splitting. (iv) Our method encodes output targets implicitly in sparse codes, while random forests encode output targets by remembering the class attributes of the activated nodes. Therefore, our method is a simpler, more straightforward, and maybe a better alternative choice, though both methods use two very basic elements---randomization and nearest neighbor optimization---as the core. This preprint is used to protect the incremental idea from (\cite{zhang2014nonlinear,zhang2014nonlinear2}). Full empirical evaluation will be announced carefully later.
LGAug 5, 2014
Multilayer bootstrap networksXiao-Lei Zhang
Multilayer bootstrap network builds a gradually narrowed multilayer nonlinear network from bottom up for unsupervised nonlinear dimensionality reduction. Each layer of the network is a nonparametric density estimator. It consists of a group of k-centroids clusterings. Each clustering randomly selects data points with randomly selected features as its centroids, and learns a one-hot encoder by one-nearest-neighbor optimization. Geometrically, the nonparametric density estimator at each layer projects the input data space to a uniformly-distributed discrete feature space, where the similarity of two data points in the discrete feature space is measured by the number of the nearest centroids they share in common. The multilayer network gradually reduces the nonlinear variations of data from bottom up by building a vast number of hierarchical trees implicitly on the original data space. Theoretically, the estimation error caused by the nonparametric density estimator is proportional to the correlation between the clusterings, both of which are reduced by the randomization steps.
LGDec 16, 2013
Learning Deep Representations By Distributed Random SamplingsXiao-Lei Zhang
In this paper, we propose an extremely simple deep model for the unsupervised nonlinear dimensionality reduction -- deep distributed random samplings, which performs like a stack of unsupervised bootstrap aggregating. First, its network structure is novel: each layer of the network is a group of mutually independent $k$-centers clusterings. Second, its learning method is extremely simple: the $k$ centers of each clustering are only $k$ randomly selected examples from the training data; for small-scale data sets, the $k$ centers are further randomly reconstructed by a simple cyclic-shift operation. Experimental results on nonlinear dimensionality reduction show that the proposed method can learn abstract representations on both large-scale and small-scale problems, and meanwhile is much faster than deep neural networks on large-scale problems.
LGAug 22, 2013
Learning Deep Representation Without Parameter Inference for Nonlinear Dimensionality ReductionXiao-Lei Zhang
Unsupervised deep learning is one of the most powerful representation learning techniques. Restricted Boltzman machine, sparse coding, regularized auto-encoders, and convolutional neural networks are pioneering building blocks of deep learning. In this paper, we propose a new building block -- distributed random models. The proposed method is a special full implementation of the product of experts: (i) each expert owns multiple hidden units and different experts have different numbers of hidden units; (ii) the model of each expert is a k-center clustering, whose k-centers are only uniformly sampled examples, and whose output (i.e. the hidden units) is a sparse code that only the similarity values from a few nearest neighbors are reserved. The relationship between the pioneering building blocks, several notable research branches and the proposed method is analyzed. Experimental results show that the proposed deep model can learn better representations than deep belief networks and meanwhile can train a much larger network with much less time than deep belief networks.
LGMay 5, 2013
Simple Deep Random Model EnsembleXiao-Lei Zhang, Ji Wu
Representation learning and unsupervised learning are two central topics of machine learning and signal processing. Deep learning is one of the most effective unsupervised representation learning approach. The main contributions of this paper to the topics are as follows. (i) We propose to view the representative deep learning approaches as special cases of the knowledge reuse framework of clustering ensemble. (ii) We propose to view sparse coding when used as a feature encoder as the consensus function of clustering ensemble, and view dictionary learning as the training process of the base clusterings of clustering ensemble. (ii) Based on the above two views, we propose a very simple deep learning algorithm, named deep random model ensemble (DRME). It is a stack of random model ensembles. Each random model ensemble is a special k-means ensemble that discards the expectation-maximization optimization of each base k-means but only preserves the default initialization method of the base k-means. (iv) We propose to select the most powerful representation among the layers by applying DRME to clustering where the single-linkage is used as the clustering algorithm. Moreover, the DRME based clustering can also detect the number of the natural clusters accurately. Extensive experimental comparisons with 5 representation learning methods on 19 benchmark data sets demonstrate the effectiveness of DRME.
LGMar 8, 2013
Heuristic Ternary Error-Correcting Output Codes Via Weight Optimization and Layered Clustering-Based ApproachXiao-Lei Zhang
One important classifier ensemble for multiclass classification problems is Error-Correcting Output Codes (ECOCs). It bridges multiclass problems and binary-class classifiers by decomposing multiclass problems to a serial binary-class problems. In this paper, we present a heuristic ternary code, named Weight Optimization and Layered Clustering-based ECOC (WOLC-ECOC). It starts with an arbitrary valid ECOC and iterates the following two steps until the training risk converges. The first step, named Layered Clustering based ECOC (LC-ECOC), constructs multiple strong classifiers on the most confusing binary-class problem. The second step adds the new classifiers to ECOC by a novel Optimized Weighted (OW) decoding algorithm, where the optimization problem of the decoding is solved by the cutting plane algorithm. Technically, LC-ECOC makes the heuristic training process not blocked by some difficult binary-class problem. OW decoding guarantees the non-increase of the training risk for ensuring a small code length. Results on 14 UCI datasets and a music genre classification problem demonstrate the effectiveness of WOLC-ECOC.
LGMar 8, 2013
Convex Discriminative Multitask ClusteringXiao-Lei Zhang
Multitask clustering tries to improve the clustering performance of multiple tasks simultaneously by taking their relationship into account. Most existing multitask clustering algorithms fall into the type of generative clustering, and none are formulated as convex optimization problems. In this paper, we propose two convex Discriminative Multitask Clustering (DMTC) algorithms to address the problems. Specifically, we first propose a Bayesian DMTC framework. Then, we propose two convex DMTC objectives within the framework. The first one, which can be seen as a technical combination of the convex multitask feature learning and the convex Multiclass Maximum Margin Clustering (M3C), aims to learn a shared feature representation. The second one, which can be seen as a combination of the convex multitask relationship learning and M3C, aims to learn the task relationship. The two objectives are solved in a uniform procedure by the efficient cutting-plane algorithm. Experimental results on a toy problem and two benchmark datasets demonstrate the effectiveness of the proposed algorithms.
LGMar 8, 2013
Transfer Learning for Voice Activity Detection: A Denoising Deep Neural Network PerspectiveXiao-Lei Zhang, Ji Wu
Mismatching problem between the source and target noisy corpora severely hinder the practical use of the machine-learning-based voice activity detection (VAD). In this paper, we try to address this problem in the transfer learning prospective. Transfer learning tries to find a common learning machine or a common feature subspace that is shared by both the source corpus and the target corpus. The denoising deep neural network is used as the learning machine. Three transfer techniques, which aim to learn common feature representations, are used for analysis. Experimental results demonstrate the effectiveness of the transfer learning schemes on the mismatch problem.
LGMar 4, 2013
Denoising Deep Neural Networks Based Voice Activity DetectionXiao-Lei Zhang, Ji Wu
Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.