CVCLLGJul 27, 2020

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

arXiv:2007.13913v36 citations
Originality Incremental advance
AI Analysis

This work addresses the slow and expensive annotation process for video captioning, offering an incremental improvement in active learning efficiency for this domain-specific task.

The paper tackles the problem of reducing manual annotation costs for video captioning by proposing a cluster-regularized ensemble active learning strategy, which achieves high performance using up to 60% fewer training data compared to strong baselines on MSR-VTT and LSMDC datasets.

Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we both explore various active learning approaches for automatic video captioning and show that a cluster-regularized ensemble strategy provides the best active learning approach to efficiently gather training sets for video captioning. We evaluate our approaches on the MSR-VTT and LSMDC datasets using both transformer and LSTM based captioning models and show that our novel strategy can achieve high performance while using up to 60% fewer training data than the strong state of the art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes