CVJul 22, 2024

AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

arXiv:2407.15850v225 citationsh-index: 50
Originality Incremental advance
AI Analysis

This addresses the need for accessible audio descriptions in media for visually impaired users, though it is incremental as it builds on existing models.

The paper tackles the problem of generating audio descriptions for movies and TV series without training, using off-the-shelf visual-language and large language models with prompting strategies, achieving state-of-the-art CRITIC scores competitive with fine-tuned models.

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes