CVOct 10, 2023

AutoAD II: The Sequel -- Who, When, and What in Movie Audio Description

arXiv:2310.06838v156 citationsh-index: 50
Originality Incremental advance
AI Analysis

This work addresses the need for accessible movie content for visually impaired audiences, though it appears incremental as it builds on prior methods.

The paper tackles the problem of automatically generating audio descriptions for movies, addressing the challenges of naming characters, timing descriptions during dialogue pauses, and generating relevant content, and demonstrates improvements over previous architectures in an apples-to-apples comparison.

Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the 'who', 'when', and 'what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes