Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR
This work addresses the challenge of accurate speech recognition in multilingual conversational settings, which is incremental as it builds on existing ASR methods with contextual enhancements.
The paper tackles the problem of improving multilingual continuous conversational automatic speech recognition by integrating language-specific bi-directional context into a speech large language model, achieving an 18% relative improvement over a strong baseline on an 11-language corpus.
This paper introduces the integration of language-specific bi-directional context into a speech large language model (SLLM) to improve multilingual continuous conversational automatic speech recognition (ASR). We propose a character-level contextual masking strategy during training, which randomly removes portions of the context to enhance robustness and better emulate the flawed transcriptions that may occur during inference. For decoding, a two-stage pipeline is utilized: initial isolated segment decoding followed by context-aware re-decoding using neighboring hypotheses. Evaluated on the 1500-hour Multilingual Conversational Speech and Language Model (MLC-SLM) corpus covering eleven languages, our method achieves an 18% relative improvement compared to a strong baseline, outperforming even the model trained on 6000 hours of data for the MLC-SLM competition. These results underscore the significant benefit of incorporating contextual information in multilingual continuous conversational ASR.