CLNov 3, 2023Code
FaMeSumm: Investigating and Improving Faithfulness of Medical SummarizationNan Zhang, Yusen Zhang, Wu Guo et al.
Summaries of medical text shall be faithful by being consistent and factual with source inputs, which is an important but understudied topic for safety and efficiency in healthcare. In this paper, we investigate and improve faithfulness in summarization on a broad range of medical summarization tasks. Our investigation reveals that current summarization models often produce unfaithful outputs for medical input text. We then introduce FaMeSumm, a framework to improve faithfulness by fine-tuning pre-trained language models based on medical knowledge. FaMeSumm performs contrastive learning on designed sets of faithful and unfaithful summaries, and it incorporates medical terms and their contexts to encourage faithful generation of medical terms. We conduct comprehensive experiments on three datasets in two languages: health question and radiology report summarization datasets in English, and a patient-doctor dialogue dataset in Chinese. Results demonstrate that FaMeSumm is flexible and effective by delivering consistent improvements over mainstream language models such as BART, T5, mT5, and PEGASUS, yielding state-of-the-art performances on metrics for faithfulness and general quality. Human evaluation by doctors also shows that FaMeSumm generates more faithful outputs. Our code is available at https://github.com/psunlpgroup/FaMeSumm .
CLDec 7, 2022
WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity RecognitionJun-Yu Ma, Beiduo Chen, Jia-Chen Gu et al.
Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields.
CLFeb 28, 2023Code
Language-Universal Adapter Learning with Knowledge Distillation for End-to-End Multilingual Speech RecognitionZhijie Shen, Wu Guo, Bin Gu
In this paper, we propose a language-universal adapter learning framework based on a pre-trained model for end-to-end multilingual automatic speech recognition (ASR). For acoustic modeling, the wav2vec 2.0 pre-trained model is fine-tuned by inserting language-specific and language-universal adapters. An online knowledge distillation is then used to enable the language-universal adapters to learn both language-specific and universal features. The linguistic information confusion is also reduced by leveraging language identifiers (LIDs). With LIDs we perform a position-wise modification on the multi-head attention outputs. In the inference procedure, the language-specific adapters are removed while the language-universal adapters are kept activated. The proposed method improves the recognition accuracy and addresses the linear increase of the number of adapters' parameters with the number of languages in common multilingual ASR systems. Experiments on the BABEL dataset confirm the effectiveness of the proposed framework. Compared to the conventional multilingual model, a 3.3% absolute error rate reduction is achieved. The code is available at: https://github.com/shen9712/UniversalAdapterLearning.
CLMar 7, 2022
USTC-NELSLIP at SemEval-2022 Task 11: Gazetteer-Adapted Integration Network for Multilingual Complex Named Entity RecognitionBeiduo Chen, Jun-Yu Ma, Jiajun Qi et al.
This paper describes the system developed by the USTC-NELSLIP team for SemEval-2022 Task 11 Multilingual Complex Named Entity Recognition (MultiCoNER). We propose a gazetteer-adapted integration network (GAIN) to improve the performance of language models for recognizing complex named entities. The method first adapts the representations of gazetteer networks to those of language models by minimizing the KL divergence between them. After adaptation, these two networks are then integrated for backend supervised named entity recognition (NER) training. The proposed method is applied to several state-of-the-art Transformer-based NER models with a gazetteer built from Wikidata, and shows great generalization ability across them. The final predictions are derived from an ensemble of these trained models. Experimental results and detailed analysis verify the effectiveness of the proposed method. The official results show that our system ranked 1st on three tracks (Chinese, Code-mixed and Bangla) and 2nd on the other ten tracks in this task.
SDSep 3, 2024
USTC-KXDIGIT System Description for ASVspoof5 ChallengeYihao Chen, Haochen Wu, Nan Jiang et al.
This paper describes the USTC-KXDIGIT system submitted to the ASVspoof5 Challenge for Track 1 (speech deepfake detection) and Track 2 (spoofing-robust automatic speaker verification, SASV). Track 1 showcases a diverse range of technical qualities from potential processing algorithms and includes both open and closed conditions. For these conditions, our system consists of a cascade of a frontend feature extractor and a back-end classifier. We focus on extensive embedding engineering and enhancing the generalization of the back-end classifier model. Specifically, the embedding engineering is based on hand-crafted features and speech representations from a self-supervised model, used for closed and open conditions, respectively. To detect spoof attacks under various adversarial conditions, we trained multiple systems on an augmented training set. Additionally, we used voice conversion technology to synthesize fake audio from genuine audio in the training set to enrich the synthesis algorithms. To leverage the complementary information learned by different model architectures, we employed activation ensemble and fused scores from different systems to obtain the final decision score for spoof detection. During the evaluation phase, the proposed methods achieved 0.3948 minDCF and 14.33% EER in the close condition, and 0.0750 minDCF and 2.59% EER in the open condition, demonstrating the robustness of our submitted systems under adversarial conditions. In Track 2, we continued using the CM system from Track 1 and fused it with a CNN-based ASV system. This approach achieved 0.2814 min-aDCF in the closed condition and 0.0756 min-aDCF in the open condition, showcasing superior performance in the SASV system.
CLMay 17, 2022
Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERTBeiduo Chen, Wu Guo, Quan Liu et al.
Multilingual BERT (mBERT), a language model pre-trained on large multilingual corpora, has impressive zero-shot cross-lingual transfer capabilities and performs surprisingly well on zero-shot POS tagging and Named Entity Recognition (NER), as well as on cross-lingual model transfer. At present, the mainstream methods to solve the cross-lingual downstream tasks are always using the last transformer layer's output of mBERT as the representation of linguistic information. In this work, we explore the complementary property of lower layers to the last transformer layer of mBERT. A feature aggregation module based on an attention mechanism is proposed to fuse the information contained in different layers of mBERT. The experiments are conducted on four zero-shot cross-lingual transfer datasets, and the proposed method obtains performance improvements on key multilingual benchmark tasks XNLI (+1.5 %), PAWS-X (+2.4 %), NER (+1.2 F1), and POS (+1.5 F1). Through the analysis of the experimental results, we prove that the layers before the last layer of mBERT can provide extra useful information for cross-lingual downstream tasks and explore the interpretability of mBERT empirically.
CLMay 6, 2023Code
Pre-training Language Model as a Multi-perspective Course LearnerBeiduo Chen, Shaohan Huang, Zihan Zhang et al.
ELECTRA, the generator-discriminator pre-training framework, has achieved impressive semantic construction capability among various downstream tasks. Despite the convincing performance, ELECTRA still faces the challenges of monotonous training and deficient interaction. Generator with only masked language modeling (MLM) leads to biased learning and label imbalance for discriminator, decreasing learning efficiency; no explicit feedback loop from discriminator to generator results in the chasm between these two components, underutilizing the course learning. In this study, a multi-perspective course learning (MCL) method is proposed to fetch a many degrees and visual angles for sample-efficient pre-training, and to fully leverage the relationship between generator and discriminator. Concretely, three self-supervision courses are designed to alleviate inherent flaws of MLM and balance the label in a multi-perspective way. Besides, two self-correction courses are proposed to bridge the chasm between the two encoders by creating a "correction notebook" for secondary-supervision. Moreover, a course soups trial is conducted to solve the "tug-of-war" dynamics problem of MCL, evolving a stronger pre-trained model. Experimental results show that our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks, and overshadows recent advanced ELECTRA-style models under the same settings. The pre-trained MCL model is available at https://huggingface.co/McmanusChen/MCL-base.
SDDec 12, 2024
On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy ProtectionChenyang Guo, Liping Chen, Zhuhai Li et al.
Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an entity generating the adversarial perturbations is authorized to remove them and restore original speech (e.g., the speaker him/herself). A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals' identities in security and forensic analysis. In this setting, the perturbation generative module is assumed to be known in the removal process. To this end, a joint training of perturbation generation and removal modules is proposed. Experimental results on the LibriSpeech dataset demonstrated that the subtle perturbations added to the original speech can be predicted from the anonymized speech while achieving the goal of privacy protection. By removing these perturbations from the anonymized sample, the original speech can be restored. Audio samples can be found in \url{https://voiceprivacy.github.io/Perturbation-Generation-Removal/}.
CLFeb 26, 2022
Multi-Level Contrastive Learning for Cross-Lingual AlignmentBeiduo Chen, Wu Guo, Bin Gu et al.
Cross-language pre-trained models such as multilingual BERT (mBERT) have achieved significant performance in various cross-lingual downstream NLP tasks. This paper proposes a multi-level contrastive learning (ML-CTL) framework to further improve the cross-lingual ability of pre-trained models. The proposed method uses translated parallel data to encourage the model to generate similar semantic embeddings for different languages. However, unlike the sentence-level alignment used in most previous studies, in this paper, we explicitly integrate the word-level information of each pair of parallel sentences into contrastive learning. Moreover, cross-zero noise contrastive estimation (CZ-NCE) loss is proposed to alleviate the impact of the floating-point error in the training process with a small batch size. The proposed method significantly improves the cross-lingual transfer ability of our basic model (mBERT) and outperforms on multiple zero-shot cross-lingual downstream tasks compared to the same-size models in the Xtreme benchmark.
CLJun 16, 2021
Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic FeaturesTan Liu, Wu Guo, Bin Gu
Topic classification systems on spoken documents usually consist of two modules: an automatic speech recognition (ASR) module to convert speech into text and a text topic classification (TTC) module to predict the topic class from the decoded text. In this paper, instead of using the ASR transcripts, the fusion of deep acoustic and linguistic features is used for topic classification on spoken documents. More specifically, a conventional CTC-based acoustic model (AM) using phonemes as output units is first trained, and the outputs of the layer before the linear phoneme classifier in the trained AM are used as the deep acoustic features of spoken documents. Furthermore, these deep acoustic features are fed to a phoneme-to-word (P2W) module to obtain deep linguistic features. Finally, a local multi-head attention module is proposed to fuse these two types of deep features for topic classification. Experiments conducted on a subset selected from Switchboard corpus show that our proposed framework outperforms the conventional ASR+TTC systems and achieves a 3.13% improvement in ACC.
ASMar 29, 2021
Improved Meta-Learning Training for Speaker VerificationYafeng Chen, Wu Guo, Bin Gu
Meta-learning has recently become a research hotspot in speaker verification (SV). We introduce two methods to improve the meta-learning training for SV in this paper. For the first method, a backbone embedding network is first jointly trained with the conventional cross entropy loss and prototypical networks (PN) loss. Then, inspired by speaker adaptive training in speech recognition, additional transformation coefficients are trained with only the PN loss. The transformation coefficients are used to modify the original backbone embedding network in the x-vector extraction process. Furthermore, the random erasing data augmentation technique is applied to all support samples in each episode to construct positive pairs, and a contrastive loss between the augmented and the original support samples is added to the objective in model training. Experiments are carried out on the SITW and VOiCES databases. Both of the methods can obtain consistent improvements over existing meta-learning training frameworks. By combining these two methods, we can observe further improvements on these two databases.
ASOct 21, 2020
Multi-task Metric Learning for Text-independent Speaker VerificationYafeng Chen, Wu Guo, Jingjing Shi et al.
In this work, we introduce metric learning (ML) to enhance the deep embedding learning for text-independent speaker verification (SV). Specifically, the deep speaker embedding network is trained with conventional cross entropy loss and auxiliary pair-based ML loss function. For the auxiliary ML task, training samples of a mini-batch are first arranged into pairs, then positive and negative pairs are selected and weighted through their own and relative similarities, and finally the auxiliary ML loss is calculated by the similarity of the selected pairs. To evaluate the proposed method, we conduct experiments on the Speaker in the Wild (SITW) dataset. The results demonstrate the effectiveness of the proposed method.
ASJan 14, 2020
Gaussian speaker embedding learning for text-independent speaker verificationBin Gu, Wu Guo
The x-vector maps segments of arbitrary duration to vectors of fixed dimension using deep neural network. Combined with the probabilistic linear discriminant analysis (PLDA) backend, the x-vector/PLDA has become the dominant framework in text-independent speaker verification. Nevertheless, how to extract the x-vector appropriate for the PLDA backend is a key problem. In this paper, we propose a Gaussian noise constrained network (GNCN) to extract xvector, which adopts a multi-task learning strategy with the primary task classifying the speakers and the auxiliary task just fitting the Gaussian noises. Experiments are carried out using the SITW database. The results demonstrate the effectiveness of our proposed method
ASJan 14, 2020
An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal ScalesBin Gu, Wu Guo
This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system
ASJan 1, 2020
Attentive batch normalization for lstm-based acoustic modeling of speech recognitionFenglin Ding, Wu Guo, Lirong Dai et al.
Batch normalization (BN) is an effective method to accelerate model training and improve the generalization performance of neural networks. In this paper, we propose an improved batch normalization technique called attentive batch normalization (ABN) in Long Short Term Memory (LSTM) based acoustic modeling for automatic speech recognition (ASR). In the proposed method, an auxiliary network is used to dynamically generate the scaling and shifting parameters in batch normalization, and attention mechanisms are introduced to improve their regularized performance. Furthermore, two schemes, frame-level and utterance-level ABN, are investigated. We evaluate our proposed methods on Mandarin and Uyghur ASR tasks, respectively. The experimental results show that the proposed ABN greatly improves the performance of batch normalization in terms of transcription accuracy for both languages.
ASMar 28, 2019
Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker VerificationLanhua You, Wu Guo, Lirong Dai et al.
In this paper, gating mechanisms are applied in deep neural network (DNN) training for x-vector-based text-independent speaker verification. First, a gated convolution neural network (GCNN) is employed for modeling the frame-level embedding layers. Compared with the time-delay DNN (TDNN), the GCNN can obtain more expressive frame-level representations through carefully designed memory cell and gating mechanisms. Moreover, we propose a novel gated-attention statistics pooling strategy in which the attention scores are shared with the output gate. The gated-attention statistics pooling combines both gating and attention mechanisms into one framework; therefore, we can capture more useful information in the temporal pooling layer. Experiments are carried out using the NIST SRE16 and SRE18 evaluation datasets. The results demonstrate the effectiveness of the GCNN and show that the proposed gated-attention statistics pooling can further improve the performance.
ASMar 28, 2019
Multi-Task Learning with High-Order Statistics for X-vector based Text-Independent Speaker VerificationLanhua You, Wu Guo, Lirong Dai et al.
The x-vector based deep neural network (DNN) embedding systems have demonstrated effectiveness for text-independent speaker verification. This paper presents a multi-task learning architecture for training the speaker embedding DNN with the primary task of classifying the target speakers, and the auxiliary task of reconstructing the first- and higher-order statistics of the original input utterance. The proposed training strategy aggregates both the supervised and unsupervised learning into one framework to make the speaker embeddings more discriminative and robust. Experiments are carried out using the NIST SRE16 evaluation dataset and the VOiCES dataset. The results demonstrate that our proposed method outperforms the original x-vector approach with very low additional complexity added.
CLDec 6, 2018
The USTC-NEL Speech Translation system at IWSLT 2018Dan Liu, Junhua Liu, Wu Guo et al.
This paper describes the USTC-NEL system to the speech translation task of the IWSLT Evaluation 2018. The system is a conventional pipeline system which contains 3 modules: speech recognition, post-processing and machine translation. We train a group of hybrid-HMM models for our speech recognition, and for machine translation we train transformer based neural machine translation models with speech recognition output style text as input. Experiments conducted on the IWSLT 2018 task indicate that, compared to baseline system from KIT, our system achieved 14.9 BLEU improvement.
CLSep 7, 2015
Integrate Document Ranking Information into Confidence Measure Calculation for Spoken Term DetectionQuan Liu, Wu Guo, Zhen-Hua Ling
This paper proposes an algorithm to improve the calculation of confidence measure for spoken term detection (STD). Given an input query term, the algorithm first calculates a measurement named document ranking weight for each document in the speech database to reflect its relevance with the query term by summing all the confidence measures of the hypothesized term occurrences in this document. The confidence measure of each term occurrence is then re-estimated through linear interpolation with the calculated document ranking weight to improve its reliability by integrating document-level information. Experiments are conducted on three standard STD tasks for Tamil, Vietnamese and English respectively. The experimental results all demonstrate that the proposed algorithm achieves consistent improvements over the state-of-the-art method for confidence measure calculation. Furthermore, this algorithm is still effective even if a high accuracy speech recognizer is not available, which makes it applicable for the languages with limited speech resources.