NCOct 7, 2023
Do self-supervised speech and language models extract similar representations as human brain?Peili Chen, Linyang He, Li Fu et al.
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception. However, given their distinct training modalities, it remains unclear whether they correlate with the same neural aspects. We directly address this question by evaluating the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and language tasks. Our findings reveal that both models accurately predict speech responses in the auditory cortex, with a significant correlation between their brain predictions. Notably, shared speech contextual information between Wav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain activity, surpassing static semantic and lower-level acoustic-phonetic information. These results underscore the convergence of speech contextual representations in SSL models and their alignment with the neural network underlying speech perception, offering valuable insights into both SSL models and the neural basis of speech and language processing.
CLSep 16, 2025
PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech RecognitionLi Fu, Yu Xin, Sunlu Zeng et al.
This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the modelś ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.
ASOct 8, 2021
SCaLa: Supervised Contrastive Learning for End-to-End Speech RecognitionLi Fu, Xiaoxiao Li, Runyu Wang et al.
End-to-end Automatic Speech Recognition (ASR) models are usually trained to optimize the loss of the whole token sequence, while neglecting explicit phonemic-granularity supervision. This could result in recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, we propose a novel framework based on Supervised Contrastive Learning (SCaLa) to enhance phonemic representation learning for end-to-end ASR systems. Specifically, we extend the self-supervised Masked Contrastive Predictive Coding (MCPC) to a fully-supervised setting, where the supervision is applied in the following way. First, SCaLa masks variable-length encoder features according to phoneme boundaries given phoneme forced-alignment extracted from a pre-trained acoustic model; it then predicts the masked features via contrastive learning. The forced-alignment can provide phoneme labels to mitigate the noise introduced by positive-negative pairs in self-supervised MCPC. Experiments on reading and spontaneous speech datasets show that our proposed approach achieves 2.8 and 1.4 points Character Error Rate (CER) absolute reductions compared to the baseline, respectively.
ASMay 11, 2020
Incremental Learning for End-to-End Automatic Speech RecognitionLi Fu, Xiaoxiao Li, Libo Zi et al.
In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Our method works without access to the training data of original tasks, which addresses the cases where the previous data is no longer available or joint training is costly. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting. Furthermore, in two practical scenarios, compared to the target-reference joint training method, the performance drop of our method is 0.02% Character Error Rate (CER), which is 97% smaller than the drops of the baseline methods.
ASApr 26, 2020
Research on Modeling Units of Transformer Transducer for Mandarin Speech RecognitionLi Fu, Xiaoxiao Li, Libo Zi
Modeling unit and model architecture are two key factors of Recurrent Neural Network Transducer (RNN-T) in end-to-end speech recognition. To improve the performance of RNN-T for Mandarin speech recognition task, a novel transformer transducer with the combination architecture of self-attention transformer and RNN is proposed. And then the choice of different modeling units for transformer transducer is explored. In addition, we present a new mix-bandwidth training method to obtain a general model that is able to accurately recognize Mandarin speech with different sampling rates simultaneously. All of our experiments are conducted on about 12,000 hours of Mandarin speech with sampling rate in 8kHz and 16kHz. Experimental results show that Mandarin transformer transducer using syllable with tone achieves the best performance. It yields an average of 14.4% and 44.1% relative Word Error Rate (WER) reduction when compared with the models using syllable initial/final with tone and Chinese character, respectively. Also, it outperforms the model based on syllable initial/final with tone with an average of 13.5% relative Character Error Rate (CER) reduction.
CVSep 11, 2019
Learning Enhanced Resolution-wise features for Human Pose EstimationKun Zhang, Peng He, Ping Yao et al.
Recently, multi-resolution networks (such as Hourglass, CPN, HRNet, etc.) have achieved significant performance on pose estimation by combining feature maps of various resolutions. In this paper, we propose a Resolution-wise Attention Module (RAM) and Gradual Pyramid Refinement (GPR), to learn enhanced resolution-wise feature maps for precise pose estimation. Specifically, RAM learns a group of weights to represent the different importance of feature maps across resolutions, and the GPR gradually merges every two feature maps from low to high resolutions to regress final human keypoint heatmaps. With the enhanced resolution-wise features learnt by CNN, we obtain more accurate human keypoint locations. The efficacies of our proposed methods are demonstrated on MS-COCO dataset, achieving state-of-the-art performance with average precision of 77.7 on COCO val2017 set and 77.0 on test-dev2017 set without using extra human keypoint training dataset.
LODec 30, 2013
Defining implication relation for classical logicLi Fu
In classical logic, "P implies Q" is equivalent to "not-P or Q". It is well known that the equivalence is problematic. Actually, from "P implies Q", "not-P or Q" can be inferred ("Implication-to-Disjunction" is valid), whereas from "not-P or Q", "P implies Q" cannot be inferred in general ("Disjunction-to-Implication" is not generally valid), so the equivalence between them is invalid in general. This work aims to remove the incorrect Disjunction-to-Implication from classical logic (CL). The logical system (the logic IRL) this paper proposes has the expected properties: (a) CL is obtained by adding Disjunction-to-Implication to IRL, and (b) Disjunction-to-Implication is not derivable in IRL; while (c) fundamental laws in classical logic, including law of excluded middle (LEM) and principle of double negation, law of non-contradiction (LNC) and ex contradictione quodlibet (ECQ), conjunction elimination and disjunction introduction, and hypothetical syllogism and disjunctive syllogism, are all retained in IRL.