Index-MSR: A high-efficiency multimodal fusion framework for speech recognition
This work addresses speech recognition challenges in applications requiring strict audio-text synchronization, such as audio translation, by efficiently using video cues, though it is incremental in leveraging multimodal data.
The paper tackles the problem of degraded speech recognition performance for domain-specific terminology and short utterances by introducing Index-MSR, a multimodal fusion framework that incorporates text-related information from videos, achieving state-of-the-art accuracy with substitution errors reduced by 20-50%.
Driven by large scale datasets and LLM based architectures, automatic speech recognition (ASR) systems have achieved remarkable improvements in accuracy. However, challenges persist for domain-specific terminology, and short utterances lacking semantic coherence, where recognition performance often degrades significantly. In this work, we present Index-MSR, an efficient multimodal speech recognition framework. At its core is a novel Multimodal Fusion Decoder (MFD), which effectively incorporates text-related information from videos (e.g., subtitles and presentation slides) into the speech recognition. This cross-modal integration not only enhances overall ASR accuracy but also yields substantial reductions in substitution errors. Extensive evaluations on both an in-house subtitle dataset and a public AVSR dataset demonstrate that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%. These results demonstrate that our approach efficiently exploits text-related cues from video to improve speech recognition accuracy, showing strong potential in applications requiring strict audio text synchronization, such as audio translation.