Mohan Shi

LG
h-index35
5papers
35citations
Novelty44%
AI Score38

5 Papers

ASAug 30, 2024
Advancing Multi-talker ASR Performance with Large Language Models

Mohan Shi, Zengrui Jin, Yaoxun Xu et al.

Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.

SDSep 1, 2024
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Zengrui Jin, Yifan Yang, Mohan Shi et al.

The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding ``Who said What and When'' in multi-talker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data.

ASFeb 5
STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

Kaiyuan Zhang, Mohan Shi, Eray Eren et al.

Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.

LGAug 22, 2025
Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet

Anyu Ying, Natarajan Balaji Shankar, Chyi-Jiunn Lin et al.

Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.

LGMar 19, 2019
Random Pairwise Shapelets Forest

Mohan Shi, Zhihai Wang, Jodong Yuan et al.

Shapelet is a discriminative subsequence of time series. An advanced shapelet-based method is to embed shapelet into accurate and fast random forest. However, it shows several limitations. First, random shapelet forest requires a large training cost for split threshold searching. Second, a single shapelet provides limited information for only one branch of the decision tree, resulting in insufficient accuracy and interpretability. Third, randomized ensemble causes interpretability declining. For that, this paper presents Random Pairwise Shapelets Forest (RPSF). RPSF combines a pair of shapelets from different classes to construct random forest. It omits threshold searching to be more efficient, includes more information for each node of the forest to be more effective. Moreover, a discriminability metric, Decomposed Mean Decrease Impurity (DMDI), is proposed to identify influential region for every class. Extensive experiments show RPSF improves the accuracy and training speed of shapelet-based forest. Case studies demonstrate the interpretability of our method.