CVNov 17, 2016

Multimodal Memory Modelling for Video Captioning

arXiv:1611.05592v1150 citations
Originality Incremental advance
AI Analysis

This work addresses video captioning for computer vision applications, presenting an incremental improvement over existing methods.

The paper tackles the challenge of mapping visual sequences to language in video captioning by proposing a Multimodal Memory Model (M3) that uses a shared memory to model long-term visual-textual dependencies, resulting in improved performance on benchmark datasets like MSVD and MSR-VTT with higher BLEU and METEOR scores.

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is still a challenging problem. In this paper, we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide global visual attention on described targets. Specifically, the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. First, text representation in the Long Short-Term Memory (LSTM) based text decoder is written into the memory, and the memory contents will be read out to guide an attention to select related visual targets. Then, the selected visual information is written into the memory, which will be further read out to the text decoder. To evaluate the proposed model, we perform experiments on two publicly benchmark datasets: MSVD and MSR-VTT. The experimental results demonstrate that our method outperforms the state-of-theart methods in terms of BLEU and METEOR.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes