CV CLOct 29, 2025

More than a Moment: Towards Coherent Sequences of Audio Descriptions

Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi

arXiv:2510.25440v1h-index: 24

Originality Incremental advance

AI Analysis

This addresses the need for better accessibility for visually impaired audiences by enhancing the coherence of audio descriptions in videos, though it is incremental as it builds on existing generation methods.

The paper tackles the problem of generating coherent sequences of audio descriptions (ADs) for videos, as existing methods often produce repetitive and isolated descriptions. They propose CoherentAD, a training-free method that selects candidates across intervals, and introduce a sequence-level metric, StoryRecall, resulting in improved narrative understanding and outperforming prior approaches.

Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

View on arXiv PDF

Similar