CV CL MMApr 7, 2022

MHMS: Multimodal Hierarchical Multimedia Summarization

Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui, Zhaowen Wang, Bo Li, Ding Zhao, Hailin Jin

arXiv:2204.03734v111.214 citationsh-index: 70

Originality Incremental advance

AI Analysis

This work addresses the need for automated multimodal summarization in applications like news articles and online videos, but it appears incremental as it builds on existing methods with a hybrid approach.

The authors tackled the problem of generating both video and textual summaries from multimedia content by proposing the MHMS framework, which uses cross-domain alignment with optimal transport to interact visual and language domains, and demonstrated its effectiveness on three multimodal datasets.

Multimedia summarization with multimodal output can play an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. In this work, we propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains to generate both video and textual summaries. Our MHMS method contains video and textual segmentation and summarization module, respectively. It formulates a cross-domain alignment objective with optimal transport distance which leverages cross-domain interaction to generate the representative keyframe and textual summary. We evaluated MHMS on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.

View on arXiv PDF

Similar