SDASMar 7

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

arXiv:2603.07263v1Has Code
Predicted impact top 10% in SD · last 90 daysOriginality Incremental advance
AI Analysis

This work is significant for improving speech recognition accuracy by incorporating broader visual context for users in diverse speaking environments, offering an incremental improvement over existing AVSR methods.

This paper addresses Context-Aware Visual Speech Recognition (CAVSR) by proposing VASR, which utilizes rich visual context beyond lip motion. The method constructs an Audio-Visual Chain-of-Thought (AV-CoT) to explicitly enforce cross-modal grounding, mitigating single-modality dominance and achieving state-of-the-art performance in CAVSR.

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes