CLCVLGMay 11, 2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

arXiv:2005.05402v11033 citationsHas Code
AI Analysis

This addresses the challenge of producing discourse-coherent video captions for applications like accessibility or content indexing, though it is incremental as it builds on existing transformer architectures.

The paper tackles the problem of generating coherent multi-sentence video descriptions by proposing MART, a memory-augmented transformer, which improves coherence and reduces repetition in paragraph captions on ActivityNet Captions and YouCookII datasets.

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events. All code is available open-source at: https://github.com/jayleicn/recurrent-transformer

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes