CLSep 12, 2024
Full-text Error Correction for Chinese Speech Recognition with Large Language ModelZhiyuan Tang, Dong Wang, Shen Huang et al. · tencent-ai
Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
CLApr 2, 2025Code
Chain of Correction for Full-text Speech Recognition with Large Language ModelsZhiyuan Tang, Dong Wang, Zhikai Zhou et al. · tencent-ai
Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC's performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.
CLSep 18, 2025
Cross-Modal Knowledge Distillation for Speech Large Language ModelsEnzhi Wang, Qicheng Li, Zhiyuan Tang et al.
In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.
CLJul 2, 2024
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language ModelsZhiyuan Tang, Dong Wang, Shen Huang et al.
Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.
SDNov 4, 2020
Can We Trust Deep Speech Prior?Ying Shi, Haolin Chen, Zhiyuan Tang et al.
Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture. Compared to conventional approaches that represent clean speech by shallow models such as Gaussians with a low-rank covariance, the new approach employs deep generative models to represent the clean speech, which often provides a better prior. Despite the clear advantage in theory, we argue that deep priors must be used with much caution, since the likelihood produced by a deep generative model does not always coincide with the speech quality. We designed a comprehensive study on this issue and demonstrated that based on deep speech priors, a reasonable SE performance can be achieved, but the results might be suboptimal. A careful analysis showed that this problem is deeply rooted in the disharmony between the flexibility of deep generative models and the nature of the maximum-likelihood (ML) training.
ASJun 4, 2020
AP20-OLR Challenge: Three Tasks and Their BaselinesZheng Li, Miao Zhao, Qingyang Hong et al.
This paper introduces the fifth oriental language recognition (OLR) challenge AP20-OLR, which intends to improve the performance of language recognition systems, along with APSIPA Annual Summit and Conference (APSIPA ASC). The data profile, three tasks, the corresponding baselines, and the evaluation principles are introduced in this paper. The AP20-OLR challenge includes more languages, dialects and real-life data provided by Speechocean and the NSFC M2ASR project, and all the data is free for participants. The challenge this year still focuses on practical and challenging problems, with three tasks: (1) cross-channel LID, (2) dialect identification and (3) noisy LID. Based on Kaldi and Pytorch, recipes for i-vector and x-vector systems are also conducted as baselines for the three tasks. These recipes will be online-published, and available for participants to configure LID systems. The baseline results on the three tasks demonstrate that those tasks in this challenge are worth paying more efforts to achieve better performance.
ASJul 16, 2019
AP19-OLR Challenge: Three Tasks and Their BaselinesZhiyuan Tang, Dong Wang, Liming Song
This paper introduces the fourth oriental language recognition (OLR) challenge AP19-OLR, including the data profile, the tasks and the evaluation principles. The OLR challenge has been held successfully for three consecutive years, along with APSIPA Annual Summit and Conference (APSIPA ASC). The challenge this year still focuses on practical and challenging tasks, precisely (1) short-utterance LID, (2) cross-channel LID and (3) zero-resource LID. The event this year includes more languages and more real-life data provided by SpeechOcean and the NSFC M2ASR project. All the data is free for participants. Recipes for x-vector system and back-end evaluation are also conducted as baselines for the three tasks. The participants can refer to these online-published recipes to deploy LID systems for convenience. We report the baseline results on the three tasks and demonstrate that the three tasks are worth some efforts to achieve better performance.
ASNov 8, 2018
Gaussian-Constrained training for speaker verificationLantian Li, Zhiyuan Tang, Ying Shi et al.
Neural models, in particular the d-vector and x-vector architectures, have produced state-of-the-art performance on many speaker verification tasks. However, two potential problems of these neural models deserve more investigation. Firstly, both models suffer from `information leak', which means that some parameters participating in model training will be discarded during inference, i.e, the layers that are used as the classifier. Secondly, these models do not regulate the distribution of the derived speaker vectors. This `unconstrained distribution' may degrade the performance of the subsequent scoring component, e.g., PLDA. This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian. Our experiments on the VoxCeleb and SITW databases demonstrated that this new training approach produced more representative and regular speaker embeddings, leading to consistent performance improvement.
ASNov 8, 2018
Phonetic-attention scoring for deep speaker features in speaker verificationLantian Li, Zhiyuan Tang, Ying Shi et al.
Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level representations, called d-vectors, can be derived and used in the automatic speaker verification (ASV) task. This simple average pooling, however, is inherently sensitive to the phonetic content of the utterance. An interesting idea borrowed from machine translation is the attention-based mechanism, where the contribution of an input word to the translation at a particular time is weighted by an attention score. This score reflects the relevance of the input word and the present translation. We can use the same idea to align utterances with different phonetic contents. This paper proposes a phonetic-attention scoring approach for d-vector systems. By this approach, an attention score is computed for each frame pair. This score reflects the similarity of the two frames in phonetic content, and is used to weigh the contribution of this frame pair in the utterance-based scoring. This new scoring approach emphasizes the frame pairs with similar phonetic contents, which essentially provides a soft alignment for utterances with any phonetic contents. Experimental results show that compared with the naive average pooling, this phonetic-attention scoring approach can deliver consistent performance improvement in ASV tasks of both text-dependent and text-independent.
CLJun 2, 2018
AP18-OLR Challenge: Three Tasks and Their BaselinesZhiyuan Tang, Dong Wang, Qing Chen
The third oriental language recognition (OLR) challenge AP18-OLR is introduced in this paper, including the data profile, the tasks and the evaluation principles. Following the events in the last two years, namely AP16-OLR and AP17-OLR, the challenge this year focuses on more challenging tasks, including (1) short-duration utterances, (2) confusing languages, and (3) open-set recognition. The same as the previous events, the data of AP18-OLR is also provided by SpeechOcean and the NSFC M2ASR project. Baselines based on both the i-vector model and neural networks are constructed for the participants' reference. We report the baseline results on the three tasks and demonstrate that the three tasks are truly challenging. All the data is free for participants, and the Kaldi recipes for the baselines have been published online.
ASFeb 27, 2018
Deep factorization for speech signalLantian Li, Dong Wang, Yixiang Chen et al.
Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors. An intuitive idea is to factorize each speech frame into individual informative factors, though it turns out to be highly difficult. Recently, we found that speaker traits, which were assumed to be long-term distributional properties, are actually short-time patterns, and can be learned by a carefully designed deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that will be presented in this paper. The proposed framework infers speech factors in a sequential way, where factors previously inferred are used as conditional variables when inferring other factors. We will show that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with a high accuracy. This factorization and reconstruction approach provides potential values for many speech processing tasks, e.g., speaker recognition and emotion recognition, as will be demonstrated in the paper.
SDNov 15, 2017
Human and Machine Speaker Recognition Based on Short Trivial EventsMiao Zhang, Xiaofei Kang, Yanqing Wang et al.
Trivial events are ubiquitous in human to human conversations, e.g., cough, laugh and sniff. Compared to regular speech, these trivial events are usually short and unclear, thus generally regarded as not speaker discriminative and so are largely ignored by present speaker recognition research. However, these trivial events are highly valuable in some particular circumstances such as forensic examination, as they are less subjected to intentional change, so can be used to discover the genuine speaker from disguised speech. In this paper, we collect a trivial event speech database that involves 75 speakers and 6 types of events, and report preliminary speaker recognition results on this database, by both human listeners and machines. Particularly, the deep feature learning technique recently proposed by our group is utilized to analyze and recognize the trivial events, which leads to acceptable equal error rates (EERs) despite the extremely short durations (0.2-0.5 seconds) of these events. Comparing different types of events, 'hmm' seems more speaker discriminative.
SDOct 31, 2017
Full-info Training for Deep Speaker Feature LearningLantian Li, Zhiyuan Tang, Dong Wang et al.
In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential problem of the present model is that it involves a parametric classifier, i.e., the last affine layer, which may consume some discriminative knowledge, thus leading to `information leak' for the feature learning. This paper presents a full-info training approach that discards the parametric classifier and enforces all the discriminative knowledge learned by the feature net. Our experiments on the Fisher database demonstrate that this new training scheme can produce more coherent features, leading to consistent and notable performance improvement on the speaker verification task.
CLJun 28, 2017
AP17-OLR Challenge: Data, Plan, and BaselineZhiyuan Tang, Dong Wang, Yixiang Chen et al.
We present the data profile and the evaluation plan of the second oriental language recognition (OLR) challenge AP17-OLR. Compared to the event last year (AP16-OLR), the new challenge involves more languages and focuses more on short utterances. The data is offered by SpeechOcean and the NSFC M2ASR project. Two types of baselines are constructed to assist the participants, one is based on the i-vector model and the other is based on various neural networks. We report the baseline results evaluated with various metrics defined by the AP17-OLR evaluation plan and demonstrate that the combined database is a reasonable data resource for multilingual research. All the data is free for participants, and the Kaldi recipes for the baselines have been published online.
SDJun 22, 2017
Deep Speaker Verification: Do We Need End to End?Dong Wang, Lantian Li, Zhiyuan Tang et al.
End-to-end learning treats the entire system as a whole adaptable black box, which, if sufficient data are available, may learn a system that works very well for the target task. This principle has recently been applied to several prototype research on speaker verification (SV), where the feature learning and classifier are learned together with an objective function that is consistent with the evaluation metric. An opposite approach to end-to-end is feature learning, which firstly trains a feature learning model, and then constructs a back-end classifier separately to perform SV. Recently, both approaches achieved significant performance gains on SV, mainly attributed to the smart utilization of deep neural networks. However, the two approaches have not been carefully compared, and their respective advantages have not been well discussed. In this paper, we compare the end-to-end and feature learning approaches on a text-independent SV task. Our experiments on a dataset sampled from the Fisher database and involving 5,000 speakers demonstrated that the feature learning approach outperformed the end-to-end approach. This is a strong support for the feature learning approach, at least with data and computation resources similar to ours.
SDJun 5, 2017
Deep Factorization for Speech SignalDong Wang, Lantian Li, Ying Shi et al.
Speech signals are complex intermingling of various informative factors, and this information blending makes decoding any of the individual factors extremely difficult. A natural idea is to factorize each speech frame into independent factors, though it turns out to be even more difficult than decoding each individual factor. A major encumbrance is that the speaker trait, a major factor in speech signals, has been suspected to be a long-term distributional pattern and so not identifiable at the frame level. In this paper, we demonstrated that the speaker factor is also a short-time spectral pattern and can be largely identified with just a few frames using a simple deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that infers speech factors in a sequential way, and factors previously inferred are used as conditional variables when inferring other factors. Our experiment on an automatic emotion recognition (AER) task demonstrated that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with high accuracy. This factorization and reconstruction approach provides a novel tool for many speech processing tasks.
SDMay 10, 2017
Deep Speaker Feature Learning for Text-independent Speaker VerificationLantian Li, Yixiang Chen, Ying Shi et al.
Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.
CLMay 9, 2017
Phone-aware Neural Language IdentificationZhiyuan Tang, Dong Wang, Yixiang Chen et al.
Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID). However, the phonetic information has been largely overlooked by most of existing neural LID models, although this information has been used in the conventional phonetic LID systems with a great success. We present a phone-aware neural LID architecture, which is a deep LSTM-RNN LID system but accepts output from an RNN-based ASR system. By utilizing the phonetic knowledge, the LID performance can be significantly improved. Interestingly, even if the test language is not involved in the ASR training, the phonetic knowledge still presents a large contribution. Our experiments conducted on four languages within the Babel corpus demonstrated that the phone-aware approach is highly effective.
CLMay 9, 2017
Phonetic Temporal Neural Model for Language IdentificationZhiyuan Tang, Dong Wang, Yixiang Chen et al.
Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID). However, the use of phonetic information has been largely overlooked by most existing neural LID methods, although this information has been used very successfully in conventional phonetic LID systems. We present a phonetic temporal neural model for LID, which is an LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is similar to traditional phonetic LID methods, but the phonetic knowledge here is much richer: it is at the frame level and involves compacted information of all phones. Our experiments conducted on the Babel database and the AP16-OLR database demonstrate that the temporal phonetic neural approach is very effective, and significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.
LGSep 28, 2016
Memory Visualization for Gated Recurrent Neural Networks in Speech RecognitionZhiyuan Tang, Ying Shi, Dong Wang et al.
Recurrent neural networks (RNNs) have shown clear superiority in sequence modeling, particularly the ones with gated units, such as long short-term memory (LSTM) and gated recurrent unit (GRU). However, the dynamic properties behind the remarkable performance remain unclear in many applications, e.g., automatic speech recognition (ASR). This paper employs visualization techniques to study the behavior of LSTM and GRU when performing speech recognition tasks. Our experiments show some interesting patterns in the gated memory, and some of them have inspired simple yet effective modifications on the network structure. We report two of such modifications: (1) lazy cell update in LSTM, and (2) shortcut connections for residual learning. Both modifications lead to more comprehensible and powerful networks.
SDSep 27, 2016
Collaborative Learning for Language and Speaker RecognitionLantian Li, Zhiyuan Tang, Dong Wang et al.
This paper presents a unified model to perform language and speaker recognition simultaneously and altogether. The model is based on a multi-task recurrent neural network where the output of one task is fed as the input of the other, leading to a collaborative learning framework that can improve both language and speaker recognition by borrowing information from each other. Our experiments demonstrated that the multi-task model outperforms the task-specific models on both tasks.
CLSep 27, 2016
OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition BaselineDong Wang, Zhiyuan Tang, Difei Tang et al.
We present the OC16-CE80 Chinese-English mixlingual speech database which was released as a main resource for training, development and test for the Chinese-English mixlingual speech recognition (MixASR-CHEN) challenge on O-COCOSDA 2016. This database consists of 80 hours of speech signals recorded from more than 1,400 speakers, where the utterances are in Chinese but each involves one or several English words. Based on the database and another two free data resources (THCHS30 and the CMU dictionary), a speech recognition (ASR) baseline was constructed with the deep neural network-hidden Markov model (DNN-HMM) hybrid system. We then report the baseline results following the MixASR-CHEN evaluation rules and demonstrate that OC16-CE80 is a reasonable data resource for mixlingual research.
CLSep 27, 2016
Multi-task Recurrent Model for True Multilingual Speech RecognitionZhiyuan Tang, Lantian Li, Dong Wang
Research on multilingual speech recognition remains attractive yet challenging. Recent studies focus on learning shared structures under the multi-task paradigm, in particular a feature sharing structure. This approach has been found effective to improve performance on each individual language. However, this approach is only useful when the deployed system supports just one language. In a true multilingual scenario where multiple languages are allowed, performance will be significantly reduced due to the competition among languages in the decoding space. This paper presents a multi-task recurrent model that involves a multilingual speech recognition (ASR) component and a language recognition (LR) component, and the ASR component is informed of the language information by the LR component, leading to a language-aware recognition. We tested the approach on an English-Chinese bilingual recognition task. The results show that the proposed multi-task recurrent model can improve performance of multilingual recognition systems.
CLMar 31, 2016
Multi-task Recurrent Model for Speech and Speaker RecognitionZhiyuan Tang, Lantian Li, Dong Wang
Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities. This is certainly not the way that people behave: we decipher both speech content and speaker traits at the same time. This paper presents a unified model to perform speech and speaker recognition simultaneously and altogether. The model is based on a unified neural network where the output of one task is fed to the input of the other, leading to a multi-task recurrent network. Experiments show that the joint model outperforms the task-specific models on both the two tasks.
LGJun 7, 2015
Knowledge Transfer Pre-trainingZhiyuan Tang, Dong Wang, Yiqiao Pan et al.
Pre-training is crucial for learning deep neural networks. Most of existing pre-training methods train simple models (e.g., restricted Boltzmann machines) and then stack them layer by layer to form the deep structure. This layer-wise pre-training has found strong theoretical foundation and broad empirical support. However, it is not easy to employ such method to pre-train models without a clear multi-layer structure,e.g., recurrent neural networks (RNNs). This paper presents a new pre-training approach based on knowledge transfer learning. In contrast to the layer-wise approach which trains model components incrementally, the new approach trains the entire model as a whole but with an easier objective function. This is achieved by utilizing soft targets produced by a prior trained model (teacher model). Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models. Experiments on a speech recognition task demonstrated that with this approach, complex RNNs can be well trained with a weaker deep neural network (DNN) model. Furthermore, the new method can be combined with conventional layer-wise pre-training to deliver additional gains.
MLMay 18, 2015
Recurrent Neural Network Training with Dark Knowledge TransferZhiyuan Tang, Dong Wang, Zhiyong Zhang
Recurrent neural networks (RNNs), particularly long short-term memory (LSTM), have gained much attention in automatic speech recognition (ASR). Although some successful stories have been reported, training RNNs remains highly challenging, especially with limited training data. Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision. This knowledge transfer learning has been employed to train simple neural nets with a complex one, so that the final performance can reach a level that is infeasible to obtain by regular training. In this paper, we employ the knowledge transfer learning approach to train RNNs (precisely LSTM) using a deep neural network (DNN) model as the teacher. This is different from most of the existing research on knowledge transfer learning, since the teacher (DNN) is assumed to be weaker than the child (RNN); however, our experiments on an ASR task showed that it works fairly well: without applying any tricks on the learning scheme, this approach can train RNNs successfully even with limited training data.