CLIRSep 17, 2020

Multi-modal Summarization for Video-containing Documents

arXiv:2009.08018v127 citations
Originality Incremental advance
AI Analysis

This work addresses the need for better summarization in multimedia applications like question answering and web search, though it is incremental as it builds on existing multi-modal techniques.

The authors tackled the problem of summarizing documents with associated videos, which existing methods had not addressed, and proposed a baseline model that outperformed existing methods in multi-modal summarization.

Summarization of multimedia data becomes increasingly significant as it is the basis for many real-world applications, such as question answering, Web search, and so forth. Most existing multi-modal summarization works however have used visual complementary features extracted from images rather than videos, thereby losing abundant information. Hence, we propose a novel multi-modal summarization task to summarize from a document and its associated video. In this work, we also build a baseline general model with effective strategies, i.e., bi-hop attention and improved late fusion mechanisms to bridge the gap between different modalities, and a bi-stream summarization strategy to employ text and video summarization simultaneously. Comprehensive experiments show that the proposed model is beneficial for multi-modal summarization and superior to existing methods. Moreover, we collect a novel dataset and it provides a new resource for future study that results from documents and videos.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes