CVCLLGDec 11, 2022

MAViC: Multimodal Active Learning for Video Captioning

arXiv:2212.11109v13 citationsh-index: 16
Originality Incremental advance
AI Analysis

This work addresses the problem of reducing annotation costs for video captioning, which is incremental as it adapts active learning to a multimodal and sequential domain.

The paper tackles the high annotation cost in video captioning by proposing MAViC, a multimodal active learning method that uses a novel acquisition function (M-SASE) to select informative samples, resulting in improved performance over baselines by a large margin.

A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes