CLCVLGMMJun 19, 2019

Multimodal Abstractive Summarization for How2 Videos

arXiv:1906.07901v11134 citations
Originality Incremental advance
AI Analysis

This addresses the problem of generating fluent summaries from multimodal data for video content creators and viewers, though it appears incremental as it builds on existing sequence-to-sequence and attention methods.

The paper tackles abstractive summarization of open-domain instructional videos by integrating video and audio transcripts using a multi-source sequence-to-sequence model with hierarchical attention, achieving results on the How2 corpus and proposing a new Content F1 metric for semantic adequacy.

In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes