SDCLASJan 26

OCR-Enhanced Multimodal ASR Can Read While Listening

arXiv:2601.18393v1h-index: 16
Originality Incremental advance
AI Analysis

This work addresses speech recognition challenges in multilingual contexts, such as movies, but is incremental as it builds on existing audio-visual methods.

The paper tackled the problem of improving automatic speech recognition (ASR) by leveraging visual information like subtitles, resulting in significant performance gains with a 5.75% WER reduction in English and 16.5% CER reduction in Chinese compared to a baseline.

Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes