SDAIASJun 8, 2025

Speech Recognition on TV Series with Video-guided Post-ASR Correction

arXiv:2506.07323v2h-index: 4
Originality Incremental advance
AI Analysis

This addresses transcription challenges in multimedia content like TV series, but it is incremental as it builds on existing ASR methods with video context.

The paper tackles the problem of low transcription accuracy in TV series due to multiple speakers and overlapping speech by proposing a Video-Guided Post-ASR Correction framework, which improves accuracy on a TV-series benchmark.

Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where multiple speakers, overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy. Existing approaches fail to explicitly leverage the rich temporal and contextual information available in the video. To address this limitation, we propose a Video-Guided Post-ASR Correction (VPC) framework that uses a Video-Large Multimodal Model (VLMM) to capture video context and refine ASR outputs. Evaluations on a TV-series benchmark show that our method consistently improves transcription accuracy in complex multimedia environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes