MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
This work addresses the challenge of generating summaries from multimodal videos for language understanding applications, representing an incremental advance by adding audio to existing text-video methods.
The paper tackles the problem of multimodal abstractive text summarization by incorporating audio, text, and video modalities, achieving a 2.51-point improvement in Content F1 score and a 1.00-point improvement in Rouge-L score over the state-of-the-art on the How2 dataset.
This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.