OCR-Enhanced Multimodal ASR Can Read While Listening

Junli Chen, Changli Tang, Yixuan Li, Guangzhi Sun, Chao Zhang

arXiv:2601.18393v1h-index: 16

Originality Incremental advance

AI Analysis

This work addresses speech recognition challenges in multilingual contexts, such as movies, but is incremental as it builds on existing audio-visual methods.

The paper tackled the problem of improving automatic speech recognition (ASR) by leveraging visual information like subtitles, resulting in significant performance gains with a 5.75% WER reduction in English and 16.5% CER reduction in Chinese compared to a baseline.

Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.

View on arXiv PDF

Similar