Xinlu He

CL
h-index7
5papers
24citations
Novelty45%
AI Score45

5 Papers

5.2CLMay 27
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He, Jacob Whitehill

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

CLSep 22, 2025
Interactive Real-Time Speaker Diarization Correction with Human Feedback

Xinlu He, Yiwen Guan, Badrivishal Paurana et al.

Most automatic speech processing systems operate in "open loop" mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users' diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.

CLJun 12, 2025
Improving Named Entity Transcription with Contextual LLM-based Revision

Viet Anh Trinh, Xinlu He, Jacob Whitehill

With recent advances in modeling and the increasing amount of supervised training data, automatic speech recognition (ASR) systems have achieved remarkable performance on general speech. However, the word error rate (WER) of state-of-the-art ASR remains high for named entities. Since named entities are often the most critical keywords, misrecognizing them can affect all downstream applications, especially when the ASR system functions as the front end of a complex system. In this paper, we introduce a large language model (LLM) revision mechanism to revise incorrect named entities in ASR predictions by leveraging the LLM's reasoning ability as well as local context (e.g., lecture notes) containing a set of correct named entities. Finally, we introduce the NER-MIT-OpenCourseWare dataset, containing 45 hours of data from MIT courses for development and testing. On this dataset, our proposed technique achieves up to 30\% relative WER reduction for named entities.

CLJun 4, 2024
Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Viet Anh Trinh, Rosy Southwell, Yiwen Guan et al.

Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.

LGSep 9, 2021
Compositional Clustering: Applications to Multi-Label Object Recognition and Speaker Identification

Zeqian Li, Xinlu He, Jacob Whitehill

We consider a novel clustering task in which clusters can have compositional relationships, e.g., one cluster contains images of rectangles, one contains images of circles, and a third (compositional) cluster contains images with both objects. In contrast to hierarchical clustering in which a parent cluster represents the intersection of properties of the child clusters, our problem is about finding compositional clusters that represent the union of the properties of the constituent clusters. This task is motivated by recently developed few-shot learning and embedding models can distinguish the label sets, not just the individual labels, assigned to the examples. We propose three new algorithms -- Compositional Affinity Propagation (CAP), Compositional k-means (CKM), and Greedy Compositional Reassignment (GCR) -- that can partition examples into coherent groups and infer the compositional structure among them. We show promising results, compared to popular algorithms such as Gaussian mixtures, Fuzzy c-means, and Agglomerative Clustering, on the OmniGlot and LibriSpeech datasets. Our work has applications to open-world multi-label object recognition and speaker identification & diarization with simultaneous speech from multiple speakers.