CLCVLGMMOct 15, 2020

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

arXiv:2010.08021v1999 citations
AI Analysis

This work addresses the challenge of generating summaries from multimodal videos for language understanding applications, representing an incremental advance by adding audio to existing text-video methods.

The paper tackles the problem of multimodal abstractive text summarization by incorporating audio, text, and video modalities, achieving a 2.51-point improvement in Content F1 score and a 1.00-point improvement in Rouge-L score over the state-of-the-art on the How2 dataset.

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes