AIMay 12

Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

Abid Ali, Diego Molla-Aliod, Usman Naseem

arXiv:2605.1175339.0

Predicted impact top 82% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in multimodal summarization, this work addresses the problem of weak cross-modal grounding with a depth-aware fusion approach.

SPeCTrA-Sum improves multimodal summarization by aligning visual and textual features at multiple depths and selecting representative images via DPP distillation, outperforming prior methods in summary accuracy and image relevance.

Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.

View on arXiv PDF

Similar