Haihua Xu

AS
19papers
287citations
Novelty47%
AI Score41

19 Papers

79.4CLMar 25Code
From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Xiaoyong Guo, Nanjie Li, Zijie Zeng et al.

Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% -> 5.63%), indicating improved robustness to misleading context. Our code and models are published on https://github.com/XYGuo1996/Contextual_Speech_LLMs.

ASJul 9, 2022
Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Yizhou Peng, Yufei Liu, Jicheng Zhang et al.

Internal Language Model Estimation (ILME) based language model (LM) fusion has been shown significantly improved recognition results over conventional shallow fusion in both intra-domain and cross-domain speech recognition tasks. In this paper, we attempt to apply our ILME method to cross-domain code-switching speech recognition (CSSR) work. Specifically, our curiosity comes from several aspects. First, we are curious about how effective the ILME-based LM fusion is for both intra-domain and cross-domain CSSR tasks. We verify this with or without merging two code-switching domains. More importantly, we train an end-to-end (E2E) speech recognition model by means of merging two monolingual data sets and observe the efficacy of the proposed ILME-based LM fusion for CSSR. Experimental results on SEAME that is from Southeast Asian and another Chinese Mainland CS data set demonstrate the effectiveness of the proposed ILME-based LM fusion method.

ASNov 1, 2022
Speech-text based multi-modal training with bidirectional attention for improved speech recognition

Yuhang Yang, Haihua Xu, Hao Huang et al.

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.

CLNov 2, 2022
Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

Rao Ma, Xiaobo Wu, Jin Qiu et al.

ASR model deployment environment is ever-changing, and the incoming speech can be switched across different domains during a session. This brings a challenge for effective domain adaptation when only target domain text data is available, and our objective is to obtain obviously improved performance on the target domain while the performance on the general domain is less undermined. In this paper, we propose an adaptive LM fusion approach called internal language model estimation based adaptive domain adaptation (ILME-ADA). To realize such an ILME-ADA, an interpolated log-likelihood score is calculated based on the maximum of the scores from the internal LM and the external LM (ELM) respectively. We demonstrate the efficacy of the proposed ILME-ADA method with both RNN-T and LAS modeling frameworks employing neural network and n-gram LMs as ELMs respectively on two domain specific (target) test sets. The proposed method can achieve significantly better performance on the target test sets while it gets minimal performance degradation on the general test set, compared with both shallow and ILME-based LM fusion methods.

ASOct 28, 2022
Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Yist Y. Lin, Tao Han, Haihua Xu et al.

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

ASJan 26, 2022
Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR

Yufei Liu, Rao Ma, Haihua Xu et al.

An end-to-end (E2E) ASR model implicitly learns a prior Internal Language Model (ILM) from the training transcripts. To fuse an external LM using Bayes posterior theory, the log likelihood produced by the ILM has to be accurately estimated and subtracted. In this paper we propose two novel approaches to estimate the ILM based on Listen-Attend-Spell (LAS) framework. The first method is to replace the context vector of the LAS decoder at every time step with a vector that is learned with training transcripts. Furthermore, we propose another method that uses a lightweight feed-forward network to directly map query vector to context vector in a dynamic sense. Since the context vectors are learned by minimizing the perplexities on training transcripts, and their estimation is independent of encoder output, hence the ILMs are accurately learned for both methods. Experiments show that the ILMs achieve the lowest perplexity, indicating the efficacy of the proposed methods. In addition, they also significantly outperform the shallow fusion method, as well as two previously proposed ILM Estimation (ILME) approaches on several datasets.

ASJul 22, 2021
Multitask-Based Joint Learning Approach To Robust ASR For Radio Communication Speech

Duo Ma, Nana Hou, Van Tung Pham et al.

To realize robust end-to-end Automatic Speech Recognition(E2E ASR) under radio communication condition, we propose a multitask-based method to joint train a Speech Enhancement (SE) module as the front-end and an E2E ASR model as the back-end in this paper. One of the advantage of the proposed method is that the entire system can be trained from scratch. Different from prior works, either component here doesn't need to perform pre-training and fine-tuning processes separately. Through analysis, we found that the success of the proposed method lies in the following aspects. Firstly, multitask learning is essential, that is the SE network is not only learning to produce more Intelligent speech, it is also aimed to generate speech that is beneficial to recognition. Secondly, we also found speech phase preserved from noisy speech is critical for improving ASR performance. Thirdly, we propose a dual channel data augmentation training method to obtain further improvement.Specifically, we combine the clean and enhanced speech to train the whole system. We evaluate the proposed method on the RATS English data set, achieving a relative WER reduction of 4.6% with the joint training method, and up to a relative WER reduction of 11.2% with the proposed data augmentation method.

SDOct 23, 2020
Enriching Under-Represented Named-Entities To Improve Speech Recognition Performance

Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham et al.

Automatic speech recognition (ASR) for under-represented named-entity (UR-NE) is challenging due to such named-entities (NE) have insufficient instances and poor contextual coverage in the training data to learn reliable estimates and representations. In this paper, we propose approaches to enriching UR-NEs to improve speech recognition performance. Specifically, our first priority is to ensure those UR-NEs to appear in the word lattice if there is any. To this end, we make exemplar utterances for those UR-NEs according to their categories (e.g. location, person, organization, etc.), ending up with an improved language model (LM) that boosts the UR-NE occurrence in the word lattice. With more UR-NEs appearing in the lattice, we then boost the recognition performance through lattice rescoring methods. We first enrich the representations of UR-NEs in a pre-trained recurrent neural network LM (RNNLM) by borrowing the embedding representations of the rich-represented NEs (RR-NEs), yielding the lattices that statistically favor the UR-NEs. Finally, we directly boost the likelihood scores of the utterances containing UR-NEs and gain further performance improvement.

ASOct 22, 2020
The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

Haobo Zhang, Tingzhi Mao, Haihua Xu et al.

We report our NTU-AISG Text-to-speech (TTS) entry systems for the Blizzard Challenge 2020 in this paper. There are two TTS tasks in this year's challenge, one is a Mandarin TTS task, the other is a Shanghai dialect TTS task. We have participated both. One of the main challenges is to build TTS systems with low-resource constraints, particularly for the case of Shanghai dialect, of which about three hours data are available to participants. To overcome the constraint, we adopt an average-speaker modeling method. That is, we first employ external Mandarin data to train both End-to-end acoustic model and WaveNet vocoder, then we use Shanghai dialect to tune the acoustic model and WaveNet vocoder respectively. Apart from this, we have no Shanghai dialect lexicon despite syllable transcripts are provided for the training data. Since we are not sure if similar syllable transcripts are provided for the evaluation data during the training stage, we use Mandarin lexicon for Shanghai dialect instead. With the letter, as decomposed from the corresponding Mandarin syllable, as input, though the naturalness and original speaker similarity of the synthesized speech are good, subjective evaluation results indicate the intelligibility of the synthesized speech is deeply undermined for the Shanghai dialect TTS system.

ASOct 22, 2020
Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Yizhou Peng, Jicheng Zhang, Haobo Zhang et al.

Human can recognize speech, as well as the peculiar accent of the speech simultaneously. However, present state-of-the-art ASR system can rarely do that. In this paper, we propose a multilingual approach to recognizing English speech, and related accent that speaker conveys using DNN-HMM framework. Specifically, we assume different accents of English as different languages. We then merge them together and train a multilingual ASR system. During decoding, we conduct two experiments. One is a monolingual ASR-based decoding, with the accent information embedded at phone level, realizing word-based accent recognition (AR), and the other is a multilingual ASR-based decoding, realizing an approximated utterance-based AR. Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy. Besides, we conduct extensive analysis for the proposed method, such as transfer learning without-domain data exploitation, cross-accent recognition confusion, as well as characteristics of accented-word.

ASMay 21, 2020
Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Zhiping Zeng, Van Tung Pham, Haihua Xu et al.

In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.

ASMay 18, 2020
Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham et al.

In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to drop the necessity of phonetic models in hybrid ASR. We study it under different settings and demonstrate its effectiveness in dealing with underrepresented NEs. Next, we study the impact of neural language model (LM) with letter-based features derived to handle infrequent words. After that, we attempt to enrich representations of underrepresented NEs in pretrained neural LM by borrowing the embedding representations of rich-represented words. This let us gain significant performance improvement on underrepresented NE recognition. Finally, we boost the likelihood scores of utterances containing NEs in the word lattices rescored by neural LMs and gain further performance improvement. The combination of the aforementioned approaches improves NE recognition by up to 42% relatively.

CVJan 5, 2020
Spatial-Scale Aligned Network for Fine-Grained Recognition

Lizhao Gao, Haihua Xu, Chong Sun et al.

Existing approaches for fine-grained visual recognition focus on learning marginal region-based representations while neglecting the spatial and scale misalignments, leading to inferior performance. In this paper, we propose the spatial-scale aligned network (SSANET) and implicitly address misalignments during the recognition process. Especially, SSANET consists of 1) a self-supervised proposal mining formula with Morphological Alignment Constraints; 2) a discriminative scale mining (DSM) module, which exploits the feature pyramid via a circulant matrix, and provides the Fourier solver for fast scale alignments; 3) an oriented pooling (OP) module, that performs the pooling operation in several pre-defined orientations. Each orientation defines one kind of spatial alignment, and the network automatically determines which is the optimal alignments through learning. With the proposed two modules, our algorithm can automatically determine the accurate local proposal regions and generate more robust target representations being invariant to various appearance variances. Extensive experiments verify that SSANET is competent at learning better spatial-scale invariant target representations, yielding superior performance on the fine-grained recognition task on several benchmarks.

CLNov 25, 2019
Independent language modeling architecture for end-to-end ASR

Van Tung Pham, Haihua Xu, Yerbolat Khassanov et al.

The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reduction on Mandarin HKUST and English NSC datasets respectively; 2)the proposed architecture works well with external LM and can be generalized to different amount of labelled data.

CLApr 8, 2019
Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Yerbolat Khassanov, Haihua Xu, Van Tung Pham et al.

The lack of code-switch training data is one of the major concerns in the development of end-to-end code-switching automatic speech recognition (ASR) models. In this work, we propose a method to train an improved end-to-end code-switching ASR using only monolingual data. Our method encourages the distributions of output token embeddings of monolingual languages to be similar, and hence, promotes the ASR model to easily code-switch between languages. Specifically, we propose to use Jensen-Shannon divergence and cosine distance based constraints. The former will enforce output embeddings of monolingual languages to possess similar distributions, while the later simply brings the centroids of two distributions to be close to each other. Experimental results demonstrate high effectiveness of the proposed method, yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English code-switching ASR task.

CLApr 8, 2019
Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Yerbolat Khassanov, Zhiping Zeng, Van Tung Pham et al.

The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare words in pre-trained NLM and consequently improve its probability estimation performance. The proposed method augments the word embedding matrices of pre-trained NLM while keeping other parameters unchanged. Specifically, our method updates the embedding vectors of rare words using embedding vectors of other semantically and syntactically similar words. To evaluate the proposed method, we enrich the rare street names in the pre-trained NLM and use it to rescore 100-best hypotheses output from the Singapore English speech recognition system. The enriched NLM reduces the word error rate by 6% relative and improves the recognition accuracy of the rare words by 16% absolute as compared to the baseline NLM.

CLNov 1, 2018
On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham et al.

Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances. In this work, we study end-to-end (E2E) approaches to the Mandarin-English code-switching speech recognition (CSSR) task. We first examine the effectiveness of using data augmentation and byte-pair encoding (BPE) subword units. More importantly, we propose a multitask learning recipe, where a language identification task is explicitly learned in addition to the E2E speech recognition task. Furthermore, we introduce an efficient word vocabulary expansion method for language modeling to alleviate data sparsity issues under the code-switching scenario. Experimental results on the SEAME data, a Mandarin-English CS corpus, demonstrate the effectiveness of the proposed methods.

CLJun 16, 2018
Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition

Pengcheng Guo, Haihua Xu, Lei Xie et al.

In this paper, we present our overall efforts to improve the performance of a code-switching speech recognition system using semi-supervised training methods from lexicon learning to acoustic modeling, on the South East Asian Mandarin-English (SEAME) data. We first investigate semi-supervised lexicon learning approach to adapt the canonical lexicon, which is meant to alleviate the heavily accented pronunciation issue within the code-switching conversation of the local area. As a result, the learned lexicon yields improved performance. Furthermore, we attempt to use semi-supervised training to deal with those transcriptions that are highly mismatched between human transcribers and ASR system. Specifically, we conduct semi-supervised training assuming those poorly transcribed data as unsupervised data. We found the semi-supervised acoustic modeling can lead to improved results. Finally, to make up for the limitation of the conventional n-gram language models due to data sparsity issue, we perform lattice rescoring using neural network language models, and significant WER reduction is obtained.

SDNov 6, 2017
Mandarin tone modeling using recurrent neural networks

Hao Huang, Ying Hu, Haihua Xu

We propose an Encoder-Classifier framework to model the Mandarin tones using recurrent neural networks (RNN). In this framework, extracted frames of features for tone classification are fed in to the RNN and casted into a fixed dimensional vector (tone embedding) and then classified into tone types using a softmax layer along with other auxiliary inputs. We investigate various configurations that help to improve the model, including pooling, feature splicing and utilization of syllable-level tone embeddings. Besides, tone embeddings and durations of the contextual syllables are exploited to facilitate tone classification. Experimental results on Mandarin tone classification show the proposed network setups improve tone classification accuracy. The results indicate that the RNN encoder-classifier based tone model flexibly accommodates heterogeneous inputs (sequential and segmental) and hence has the advantages from both the sequential classification tone models and segmental classification tone models.