SDMar 3, 2023
DWFormer: Dynamic Window transFormer for Speech Emotion RecognitionShuaiqi Chen, Xiaofen Xing, Weibin Zhang et al.
Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
LGJan 17, 2024
Federated Unlearning for Human Activity RecognitionKongyang Chen, Dongping zhang, Yaping Chai et al.
The rapid evolution of Internet of Things (IoT) technology has spurred the widespread adoption of Human Activity Recognition (HAR) in various daily life domains. Federated Learning (FL) is frequently utilized to build a global HAR model by aggregating user contributions without transmitting raw individual data. Despite substantial progress in user privacy protection with FL, challenges persist. Regulations like the General Data Protection Regulation (GDPR) empower users to request data removal, raising a new query in FL: How can a HAR client request data removal without compromising other clients' privacy? In response, we propose a lightweight machine unlearning method for refining the FL HAR model by selectively removing a portion of a client's training data. Our method employs a third-party dataset unrelated to model training. Using KL divergence as a loss function for fine-tuning, we aim to align the predicted probability distribution on forgotten data with the third-party dataset. Additionally, we introduce a membership inference evaluation method to assess unlearning effectiveness. Experimental results across diverse datasets show our method achieves unlearning accuracy comparable to \textit{retraining} methods, resulting in speedups ranging from hundreds to thousands.
CVSep 15, 2023
An inspection technology of inner surface of the fine hole based on machine visionRongfang He, Weibin Zhang, Guofang Gao
Fine holes are an important structural component of industrial components, and their inner surface quality is closely related to their function.In order to detect the quality of the inner surface of the fine hole,a special optical measurement system was investigated in this paper. A sight pipe is employed to guide the external illumination light into the fine hole and output the relevant images simultaneously. A flexible light array is introduced to suit the narrow space, and the effective field of view is analyzed. Besides, the arc surface projection error and manufacturing assembly error of the device are analyzed, then compensated or ignored if small enough. In the test of prefabricated circular defects with the diameter φ0.1mm, φ0.2mm, 0.4mm distance distribution and the fissure defects with the width 0.3mm, the maximum measurement error standard deviation are all about 10μm. The minimum diameter of the measured fine hole is 4mm and the depth can reach 47mm.
SPApr 27, 2024
Co-learning-aided Multi-modal-deep-learning Framework of Passive DOA Estimators for a Heterogeneous Hybrid Massive MIMO ReceiverJiatong Bai, Feng Shu, Qinghe Zheng et al.
Due to its excellent performance in rate and resolution, fully-digital (FD) massive multiple-input multiple-output (MIMO) antenna arrays has been widely applied in data transmission and direction of arrival (DOA) measurements, etc. But it confronts with two main challenges: high computational complexity and circuit cost. The two problems may be addressed well by hybrid analog-digital (HAD) structure. But there exists the problem of phase ambiguity for HAD, which leads to its low-efficiency or high-latency. Does exist there such a MIMO structure of owning low-cost, low-complexity and high time efficiency at the same time. To satisfy the three properties, a novel heterogeneous hybrid MIMO receiver structure of integrating FD and heterogeneous HAD ($\rm{H}^2$AD-FD) is proposed and corresponding multi-modal (MD)-learning framework is developed. The framework includes three major stages: 1) generate the candidate sets via root multiple signal classification (Root-MUSIC) or deep learning (DL); 2) infer the class of true solutions from candidate sets using machine learning (ML) methods; 3) fuse the two-part true solutions to achieve a better DOA estimation. The above process form two methods named MD-Root-MUSIC and MDDL. To improve DOA estimation accuracy and reduce the clustering complexity, a co-learning-aided MD framework is proposed to form two enhanced methods named CoMDDL and CoMD-RootMUSIC. Moreover, the Cramer-Rao lower bound (CRLB) for the proposed $\rm{H}^2$AD-FD structure is also derived. Experimental results demonstrate that our proposed four methods could approach the CRLB for signal-to-noise ratio (SNR) > 0 dB and the proposed CoMDDL and MDDL perform better than CoMD-RootMUSIC and MD-RootMUSIC, particularly in the extremely low SNR region.
CRMar 16, 2025
Defense Against Model Stealing Based on Account-Aware Distribution DiscrepancyJian-Ping Mei, Weibin Zhang, Jie Chen et al.
Malicious users attempt to replicate commercial models functionally at low cost by training a clone model with query responses. It is challenging to timely prevent such model-stealing attacks to achieve strong protection and maintain utility. In this paper, we propose a novel non-parametric detector called Account-aware Distribution Discrepancy (ADD) to recognize queries from malicious users by leveraging account-wise local dependency. We formulate each class as a Multivariate Normal distribution (MVN) in the feature space and measure the malicious score as the sum of weighted class-wise distribution discrepancy. The ADD detector is combined with random-based prediction poisoning to yield a plug-and-play defense module named D-ADD for image classification models. Results of extensive experimental studies show that D-ADD achieves strong defense against different types of attacks with little interference in serving benign users for both soft and hard-label settings.
IVDec 9, 2024
A CT Image Denoising Method Based on Projection Domain FeatureMengyu Sun, Dimeng Xia, Shusen Zhao et al.
In order to improve image quality of projection in industrial applications, generally, a standard method is to increase the current or exposure time, which might cause overexposure of detector units in areas of thin objects or backgrounds. Increasing the projection sampling is a better method to address the issue, but it also leads to significant noise in the reconstructed image. This paper proposed a projection domain denoising algorithm based on the features of the projection domain for this case. This algorithm utilized the similarity of projections of neighboring veiws to reduce image noise quickly and effectively. The availability of the algorithm proposed in this work has been conducted by numerical simulation and practical data experiments.
ASFeb 2, 2022
The CORAL++ Algorithm for Unsupervised Domain Adaptation of Speaker RecogntionRongjin Li, Weibin Zhang, Dongpeng Chen
State-of-the-art speaker recognition systems are trained with a large amount of human-labeled training data set. Such a training set is usually composed of various data sources to enhance the modeling capability of models. However, in practical deployment, unseen condition is almost inevitable. Domain mismatch is a common problem in real-life applications due to the statistical difference between the training and testing data sets. To alleviate the degradation caused by domain mismatch, we propose a new feature-based unsupervised domain adaptation algorithm. The algorithm we propose is a further optimization based on the well-known CORrelation ALignment (CORAL), so we call it CORAL++. On the NIST 2019 Speaker Recognition Evaluation (SRE19), we use SRE18 CTS set as the development set to verify the effectiveness of CORAL++. With the typical x-vector/PLDA setup, the CORAL++ outperforms the CORAL by 9.40% relatively on EER.
ASJan 26, 2022
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASRYufei Liu, Rao Ma, Haihua Xu et al.
An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated and subtracted. In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method is to replace the context vector of the LAS decoder at every time step with a vector that is learned with training transcripts. Furthermore, we propose another method that uses a lightweight feed-forward network to directly map query vector to context vector in a dynamic sense. Since the context vectors are learned by minimizing the perplexities on training transcripts, and their estimation is independent of encoder output, hence the ILMs are accurately learned for both methods. Experiments show that the ILMs achieve the lowest perplexity, indicating the efficacy of the proposed methods. In addition, they also significantly outperform the shallow fusion method, as well as two previously proposed ILM Estimation (ILME) approaches on several datasets.
CLMay 1, 2020
Multi-head Monotonic Chunkwise Attention For Online Speech RecognitionBaiji Liu, Songjun Cao, Sining Sun et al.
The attention mechanism of the Listen, Attend and Spell (LAS) model requires the whole input sequence to calculate the attention context and thus is not suitable for online speech recognition. To deal with this problem, we propose multi-head monotonic chunk-wise attention (MTH-MoChA), an improved version of MoChA. MTH-MoChA splits the input sequence into small chunks and computes multi-head attentions over the chunks. We also explore useful training strategies such as LSTM pooling, minimum world error rate training and SpecAugment to further improve the performance of MTH-MoChA. Experiments on AISHELL-1 data show that the proposed model, along with the training strategies, improve the character error rate (CER) of MoChA from 8.96% to 7.68% on test set. On another 18000 hours in-car speech data set, MTH-MoChA obtains 7.28% CER, which is significantly better than a state-of-the-art hybrid system.
SYFeb 22, 2020
Vehicle Tracking in Wireless Sensor Networks via Deep Reinforcement LearningJun Li, Zhichao Xing, Weibin Zhang et al.
Vehicle tracking has become one of the key applications of wireless sensor networks (WSNs) in the fields of rescue, surveillance, traffic monitoring, etc. However, the increased tracking accuracy requires more energy consumption. In this letter, a decentralized vehicle tracking strategy is conceived for improving both tracking accuracy and energy saving, which is based on adjusting the intersection area between the fixed sensing area and the dynamic activation area. Then, two deep reinforcement learning (DRL) aided solutions are proposed relying on the dynamic selection of the activation area radius. Finally, simulation results show the superiority of our DRL aided design.
CLJun 26, 2019
Essence Knowledge Distillation for Speech RecognitionZhenchuan Yang, Chun Zhang, Weibin Zhang et al.
It is well known that a speech recognition system that combines multiple acoustic models trained on the same data significantly outperforms a single-model system. Unfortunately, real time speech recognition using a whole ensemble of models is too computationally expensive. In this paper, we propose to distill the knowledge of essence in an ensemble of models (i.e. the teacher model) to a single model (i.e. the student model) that needs much less computation to deploy. Previously, all the soften outputs of the teacher model are used to optimize the student model. We argue that not all the outputs of the ensemble are necessary to be distilled. Some of the outputs may even contain noisy information that is useless or even harmful to the training of the student model. In addition, we propose to train the student model with a multitask learning approach by utilizing both the soften outputs of the teacher model and the correct hard labels. The proposed method achieves some surprising results on the Switchboard data set. When the student model is trained together with the correct labels and the essence knowledge from the teacher model, it not only significantly outperforms another single model with the same architecture that is trained only with the correct labels, but also consistently outperforms the teacher model that is used to generate the soft labels.