CVCLLGMMJul 23, 2020

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

arXiv:2007.11888v161 citations
Originality Incremental advance
AI Analysis

This work addresses video captioning for AI applications, presenting an incremental improvement over existing methods.

The paper tackles the problem of applying transformers to video captioning by addressing redundancy in video features, proposing SBAT which reduces redundancy and improves multimodal interaction, achieving state-of-the-art results on benchmark datasets.

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes