CVMay 8, 2019

Multimodal Semantic Attention Network for Video Captioning

arXiv:1905.02963v111 citations
Originality Incremental advance
AI Analysis

This work addresses video captioning for AI applications, presenting an incremental improvement over existing methods.

The paper tackles video captioning by proposing a Multimodal Semantic Attention Network (MSAN) that integrates multimodal semantic attributes, achieving competitive results on MSVD and MSR-VTT benchmarks across six evaluation metrics.

Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes