CLCVOct 12, 2020

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

arXiv:2010.05406v11002 citations
Originality Incremental advance
AI Analysis

This work addresses the need for automated cover frame selection and textual summarization to save time for editors and improve decision-making for readers in multimedia news, representing a domain-specific incremental advancement.

The paper tackles the problem of generating multimodal summaries for video-based news articles by proposing the VMSMO task and a Dual-Interaction-based Multimodal Summarizer (DIMS), which achieves state-of-the-art performance on a large-scale real-world dataset.

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes