CVAug 28, 2024

SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

arXiv:2408.15829v32 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of producing accurate, concise summaries from multimodal data for applications like content analysis, but it appears incremental as it builds on existing transformer-based methods with novel filtering components.

The paper tackles the problem of extreme multimodal summarization with multimodal output (XMSMO), where existing methods are misled by topic-irrelevant information in multimodal data, leading to inaccurate summaries. The proposed SITransformer uses a shared information-guided pipeline to extract salient cross-modal content and improves summarization quality, with experiments showing significant enhancements for video and text summaries.

Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach by integrating various types of information to create extremely concise yet informative summaries for individual modalities. Existing methods overlook the issue that multimodal data often contains more topic irrelevant information, which can mislead the model into producing inaccurate summaries especially for extremely short ones. In this paper, we propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization. It has a shared information guided pipeline which involves a cross-modal shared information extractor and a cross-modal interaction module. The extractor formulates semantically shared salient information from different modalities by devising a novel filtering process consisting of a differentiable top-k selector and a shared-information guided gating unit. As a result, the common, salient, and relevant contents across modalities are identified. Next, a transformer with cross-modal attentions is developed for intra- and inter-modality learning with the shared information guidance to produce the extreme summary. Comprehensive experiments demonstrate that SITransformer significantly enhances the summarization quality for both video and text summaries for XMSMO. Our code will be publicly available at https://github.com/SichengLeoLiu/MMAsia24-XMSMO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes