CVMar 31, 2025

DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

arXiv:2503.24096v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This addresses the challenge of providing accessible visual storytelling for vision-impaired audiences, representing an incremental improvement over prior methods.

The paper tackled the problem of generating coherent long-term audio descriptions for videos, which existing methods struggled with due to a lack of contextual information across scenes, and introduced DANTE-AD, a dual-vision Transformer-based model that outperformed existing methods on NLP and LLM-based evaluations.

Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes