CVAIJan 30, 2025

MAMS: Model-Agnostic Module Selection Framework for Video Captioning

arXiv:2501.18269v11 citationsh-index: 1AAAI
Originality Incremental advance
AI Analysis

This addresses a critical challenge in video captioning for AI applications, though it is incremental as it builds on existing multi-modal transformer methods.

The paper tackles the problem of selecting an appropriate number of video frames for captioning to avoid missing important information or redundancy, proposing a model-agnostic module selection framework that improves performance on three benchmark datasets.

Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes