CVJul 22, 2024

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Andrew Zisserman

arXiv:2407.15850v215.825 citationsh-index: 50Has Code

Originality Incremental advance

AI Analysis

This addresses the need for accessible audio descriptions in media for visually impaired users, though it is incremental as it builds on existing models.

The paper tackles the problem of generating audio descriptions for movies and TV series without training, using off-the-shelf visual-language and large language models with prompting strategies, achieving state-of-the-art CRITIC scores competitive with fine-tuned models.

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

View on arXiv PDF Code

Similar