CLApr 8, 2022
Transducer-based language embedding for spoken language identificationPeng Shen, Xugang Lu, Hisashi Kawai
The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were carried out on the large-scale multilingual LibriSpeech and VoxLingua107 datasets. Experimental results showed the proposed method significantly improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively.
83.9CVApr 19Code
RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images UnderstandingGaozhi Zhou, Hu He, Peng Shen et al.
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
CLJul 29, 2022
Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognitionPeng Shen, Xugang Lu, Hisashi Kawai
For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems. In this study, we propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems. The proposed encoding is a combination of pronunciation-base syllable and character index (CI). By introducing the CI, the RNN-T model can overcome the homophone problem while utilizing the pronunciation information for extracting modeling units. With the proposed encoding, the model outputs can be converted into the final recognition result through a one-to-one mapping. We conducted experiments on Aishell and MagicData datasets, and the experimental results showed the effectiveness of the proposed method.
SDSep 3, 2024
Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASRXugang Lu, Peng Shen, Yu Tsao et al.
Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR.
AIDec 4, 2024
STDCformer: A Transformer-Based Model with a Spatial-Temporal Causal De-Confounding Strategy for Crowd Flow PredictionSilu He, Peng Shen, Pingzhen Xu et al.
Existing works typically treat spatial-temporal prediction as the task of learning a function $F$ to transform historical observations to future observations. We further decompose this cross-time transformation into three processes: (1) Encoding ($E$): learning the intrinsic representation of observations, (2) Cross-Time Mapping ($M$): transforming past representations into future representations, and (3) Decoding ($D$): reconstructing future observations from the future representations. From this perspective, spatial-temporal prediction can be viewed as learning $F = E \cdot M \cdot D$, which includes learning the space transformations $\left\{{E},{D}\right\}$ between the observation space and the hidden representation space, as well as the spatial-temporal mapping $M$ from future states to past states within the representation space. This leads to two key questions: \textbf{Q1: What kind of representation space allows for mapping the past to the future? Q2: How to achieve map the past to the future within the representation space?} To address Q1, we propose a Spatial-Temporal Backdoor Adjustment strategy, which learns a Spatial-Temporal De-Confounded (STDC) representation space and estimates the de-confounding causal effect of historical data on future data. This causal relationship we captured serves as the foundation for subsequent spatial-temporal mapping. To address Q2, we design a Spatial-Temporal Embedding (STE) that fuses the information of temporal and spatial confounders, capturing the intrinsic spatial-temporal characteristics of the representations. Additionally, we introduce a Cross-Time Attention mechanism, which queries the attention between the future and the past to guide spatial-temporal mapping.
SDDec 18, 2023
Speaker Mask Transformer for Multi-talker Overlapped Speech RecognitionPeng Shen, Xugang Lu, Hisashi Kawai
Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this paper, to better address these tasks, we first introduce speaker labels into an autoregressive transformer-based speech recognition model to support multi-speaker overlapped speech recognition. Then, to improve speaker diarization, we propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously using a single model. Experimental results on the LibriSpeech-based overlapped dataset demonstrate the effectiveness of the proposed method in both speech recognition and speaker diarization tasks, particularly enhancing the accuracy of speaker diarization in relatively complex multi-talker scenarios.
CLFeb 21, 2025
Retrieval-Augmented Speech Recognition Approach for Domain ChallengesPeng Shen, Xugang Lu, Hisashi Kawai
Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.
CLDec 18, 2023
Generative linguistic representation for spoken language identificationPeng Shen, Xuguang Lu, Hisashi Kawai
Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance. With the success of recent large models, such as GPT and Whisper, the potential to leverage such pre-trained models for extracting linguistic features for LID tasks has become a promising area of research. In this paper, we explore the utilization of the decoder-based network from the Whisper model to extract linguistic features through its generative mechanism for improving the classification accuracy in LID tasks. We devised two strategies - one based on the language embedding method and the other focusing on direct optimization of LID outputs while simultaneously enhancing the speech recognition tasks. We conducted experiments on the large-scale multilingual datasets MLS, VoxLingua107, and CommonVoice to test our approach. The experimental results demonstrated the effectiveness of the proposed method on both in-domain and out-of-domain datasets for LID tasks.
CLSep 6, 2025
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASRXugang Lu, Peng Shen, Yu Tsao et al.
Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.
ASMay 19, 2025
Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASRXugang Lu, Peng Shen, Yu Tsao et al.
Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.
LGJan 25, 2025
Development and Validation of a Dynamic Kidney Failure Prediction Model based on Deep Learning: A Real-World Study with External ValidationJingying Ma, Jinwei Wang, Lanlan Lu et al.
Background: Chronic kidney disease (CKD), a progressive disease with high morbidity and mortality, has become a significant global public health problem. Most existing models are static and fail to capture temporal trends in disease progression, limiting their ability to inform timely interventions. We address this gap by developing a dynamic model that leverages common longitudinal clinical indicators from real-world Electronic Health Records (EHRs) for real-time kidney failure prediction. Findings: A retrospective cohort of 4,587 patients from Yinzhou, China, was used for model development (2,752 patients for training, 917 patients for validation) and internal validation (918 patients), while external validation was conducted on a prospective PKUFH cohort (934 patients). The model demonstrated competitive performance across datasets, with an AUROC of 0.9311 (95%CI, 0.8873-0.9749) in the internal validation cohort and 0.8141 (95%CI, 0.7728-0.8554) in the external validation cohort, alongside progressively improving dynamic predictions, good calibration, and clinically consistent interpretability. KFDeep has been deployed on an open-access website and in primary care settings. Interpretation: The KFDeep model enables dynamic prediction of kidney failure without increasing clinical examination costs. It has been integrated into existing hospital systems, providing physicians with a continuously updated decision-support tool in routine care.
ASMar 31, 2022
Partial Coupling of Optimal Transport for Spoken Language IdentificationXugang Lu, Peng Shen, Yu Tsao et al.
In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT). A discrepancy measurement based on OT was adopted for JDA between training and test data sets. In our previous study, it was supposed that the training and test sets share the same label space. However, in real applications, the label space of the test set is only a subset of that of the training set. Fully matching training and test domains for distribution alignment may introduce negative domain transfer. In this paper, we propose an JDA model based on partial optimal transport (POT), i.e., only partial couplings of OT are allowed during JDA. Moreover, since the label of test data is unknown, in the POT, a soft weighting on the coupling based on transport cost is adaptively set during domain alignment. Experiments were carried out on a cross-domain SLID task to evaluate the proposed UDA. Results showed that our proposed UDA significantly improved the performance due to the consideration of the partial couplings in OT.
ASApr 7, 2021
Siamese Neural Network with Joint Bayesian Model Structure for Speaker VerificationXugang Lu, Peng Shen, Yu Tsao et al.
Generative probability models are widely used for speaker verification (SV). However, the generative models are lack of discriminative feature selection ability. As a hypothesis test, the SV can be regarded as a binary classification task which can be designed as a Siamese neural network (SiamNN) with discriminative training. However, in most of the discriminative training for SiamNN, only the distribution of pair-wised sample distances is considered, and the additional discriminative information in joint distribution of samples is ignored. In this paper, we propose a novel SiamNN with consideration of the joint distribution of samples. The joint distribution of samples is first formulated based on a joint Bayesian (JB) based generative model, then a SiamNN is designed with dense layers to approximate the factorized affine transforms as used in the JB model. By initializing the SiamNN with the learned model parameters of the JB model, we further train the model parameters with the pair-wised samples as a binary discrimination task for SV. We carried out SV experiments on data corpus of speakers in the wild (SITW) and VoxCeleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state of the art models for SV.
ASJan 9, 2021
Coupling a generative model with a discriminative learning framework for speaker verificationXugang Lu, Peng Shen, Yu Tsao et al.
The speaker verification (SV) task is to decide whether an utterance is spoken by a target or an imposter speaker. For most studies, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discriminative feature selection ability, and is easy to be distracted by nuisance features. The SV could be formulated as a binary discrimination task where neural network-based discriminative learning could be applied. In discriminative learning, the nuisance features could be removed with the help of label supervision. However, discriminative learning pays more attention to classification boundaries and is prone to overfitting to a training set which may result in bad generalization on a test set. Thus, we propose a hybrid learning framework, i.e., coupling a joint Bayesian (JB) generative model structure and parameters with a neural discriminative learning framework for SV. A two-branch Siamese neural network is built with dense layers that are coupled with factorized affine transforms as used in the JB model. The LLR score estimation in the JB model is formulated according to the distance metric in the discriminative learning framework. By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we train the model parameters with the pairwise samples as a binary discrimination task. Moreover, a direct evaluation metric in SV based on minimum empirical Bayes risk is designed and integrated as an objective function in discriminative learning. We carried out SV experiments on Speakers in the wild and Voxceleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state-of-art models for SV.
LGDec 24, 2020
Unsupervised neural adaptation model based on optimal transport for spoken language identificationXugang Lu, Peng Shen, Yu Tsao et al.
Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy on both feature and classifier for training and testing data sets. Moreover, inspired by the strong power of the optimal transport (OT) to measure distribution discrepancy, a Wasserstein distance metric is designed in the adaptation loss. By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced. We carried out SLID experiments on the oriental language recognition (OLR) challenge data corpus where the training and testing data sets were collected from different conditions. Our results showed that significant improvements were achieved on the cross domain test tasks.
SDDec 27, 2019
Cross-scale Attention Model for Acoustic Event ClassificationXugang Lu, Peng Shen, Sheng Li et al.
A major advantage of a deep convolutional neural network (CNN) is that the focused receptive field size is increased by stacking multiple convolutional layers. Accordingly, the model can explore the long-range dependency of features from the top layers. However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation. This limitation is especially evident in the acoustic event classification (AEC) task, where both short- and long-duration events are involved in an audio clip and needed to be classified. In this paper, we propose a cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation. Moreover, we propose the adoption of the attention mechanism to specify the weights of local and global features based on the spatial and temporal characteristics of acoustic events. Using mathematic formulations, we further reveal that the proposed CSA model can be regarded as a weighted residual CNN (ResCNN) model when the ResCNN is used as a backbone model. We tested the proposed model on two AEC datasets: one is an urban AEC task, and the other is an AEC task in smart car environments. Experimental results show that the proposed CSA model can effectively improve the performance of current state-of-the-art deep learning algorithms.