Yanlu Xie

h-index9

6papers

62citations

Novelty46%

AI Score26

Ranked #158,376 of 194,257 authors (top 82%)#1,374 in SD (top 76%)

6 Papers

7.1SDJun 15, 2022Code

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Linkai Peng, Yingming Gao, Binghuai Lin et al.

Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT). In the field of assessing the pronunciation quality of constrained speech, the given transcriptions can play the role of a teacher. Conventional methods have fully utilized the prior texts for the model construction or improving the system performance, e.g. forced-alignment and extended recognition networks. Recently, some end-to-end based methods attempt to incorporate the prior texts into model training and preliminarily show the effectiveness. However, previous studies mostly consider applying raw attention mechanism to fuse audio representations with text representations, without taking possible text-pronunciation mismatch into account. In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information. Moreover, given the transcriptions, we design an extra contrastive loss to reduce the gap between the learning objective of phoneme recognition and MDD. We conducted experiments using two publicly available datasets (TIMIT and L2-Arctic) and our best model improved the F1 score from $57.51\%$ to $61.75\%$ compared to the baselines. Besides, we provide a detailed analysis to shed light on the effectiveness of gating mechanism and contrastive learning on MDD.

2.3SDMay 6, 2021

Speech Enhancement using Separable Polling Attention and Global Layer Normalization followed with PReLU

Dengfeng Ke, Jinsong Zhang, Yanlu Xie et al.

Single channel speech enhancement is a challenging task in speech community. Recently, various neural networks based methods have been applied to speech enhancement. Among these models, PHASEN and T-GSA achieve state-of-the-art performances on the publicly opened VoiceBank+DEMAND corpus. Both of the models reach the COVL score of 3.62. PHASEN achieves the highest CSIG score of 4.21 while T-GSA gets the highest PESQ score of 3.06. However, both of these two models are very large. The contradiction between the model performance and the model size is hard to reconcile. In this paper, we introduce three kinds of techniques to shrink the PHASEN model and improve the performance. Firstly, seperable polling attention is proposed to replace the frequency transformation blocks in PHASEN. Secondly, global layer normalization followed with PReLU is used to replace batch normalization followed with ReLU. Finally, BLSTM in PHASEN is replaced with Conv2d operation and the phase stream is simplified. With all these modifications, the size of the PHASEN model is shrunk from 33M parameters to 5M parameters, while the performance on VoiceBank+DEMAND is improved to the CSIG score of 4.30, the PESQ score of 3.07 and the COVL score of 3.73.

3.2CLApr 17, 2021Code

A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques

Kaiqi Fu, Jones Lin, Dengfeng Ke et al.

Recently, end-to-end mispronunciation detection and diagnosis (MD&D) systems has become a popular alternative to greatly simplify the model-building process of conventional hybrid DNN-HMM systems by representing complicated modules with a single deep network architecture. In this paper, in order to utilize the prior text in the end-to-end structure, we present a novel text-dependent model which is difference with sed-mdd, the model achieves a fully end-to-end system by aligning the audio with the phoneme sequences of the prior text inside the model through the attention mechanism. Moreover, the prior text as input will be a problem of imbalance between positive and negative samples in the phoneme sequence. To alleviate this problem, we propose three simple data augmentation methods, which effectively improve the ability of model to capture mispronounced phonemes. We conduct experiments on L2-ARCTIC, and our best performance improved from 49.29% to 56.08% in F-measure metric compared to the CNN-RNN-CTC model.

4.3ASSep 5, 2020

A multi-view approach for Mandarin non-native mispronunciation verification

Zhenyu Wang, John H. L. Hansen, Yanlu Xie

Traditionally, the performance of non-native mispronunciation verification systems relied on effective phone-level labelling of non-native corpora. In this study, a multi-view approach is proposed to incorporate discriminative feature representations which requires less annotation for non-native mispronunciation verification of Mandarin. Here, models are jointly learned to embed acoustic sequence and multi-source information for speech attributes and bottleneck features. Bidirectional LSTM embedding models with contrastive losses are used to map acoustic sequences and multi-source information into fixed-dimensional embeddings. The distance between acoustic embeddings is taken as the similarity between phones. Accordingly, examples of mispronounced phones are expected to have a small similarity score with their canonical pronunciations. The approach shows improvement over GOP-based approach by +11.23% and single-view approach by +1.47% in diagnostic accuracy for a mispronunciation verification task.

1.2ASMay 21, 2020

Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism

Wang Dai, Jinsong Zhang, Yingming Gao et al.

Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for formant tracking. In addition to the conventional implementation, we modified the architecture from three aspects. First, we turned off the "causal" mode of dilated convolution, making the dilated convolution see the future speech frames. Second, each hidden layer reused the output information from all the previous layers through dense connection. Third, we also adopted a gating mechanism to alleviate the problem of gradient disappearance by selectively forgetting unimportant information. The model was validated on the open access formant database VTR. The experiment showed that our proposed model was easy to converge and achieved an overall mean absolute percent error (MAPE) of 8.2% on speech-labeled frames, compared to three competitive baselines of 9.4% (LSTM), 9.1% (Bi-LSTM) and 8.9% (TCN).

1.2NCJan 19, 2019

Convolution Forgetting Curve Model for Repeated Learning

Yanlu Xie, Yue Chen, Man Li

Most of mathematic forgetting curve models fit well with the forgetting data under the learning condition of one time rather than repeated. In the paper, a convolution model of forgetting curve is proposed to simulate the memory process during learning. In this model, the memory ability (i.e. the central procedure in the working memory model) and learning material (i.e. the input in the working memory model) is regarded as the system function and the input function, respectively. The status of forgetting (i.e. the output in the working memory model) is regarded as output function or the convolution result of the memory ability and learning material. The model is applied to simulate the forgetting curves in different situations. The results show that the model is able to simulate the forgetting curves not only in one time learning condition but also in multi-times condition. The model is further verified in the experiments of Mandarin tone learning for Japanese learners. And the predicted curve fits well on the test points.