CVAIMar 6

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

arXiv:2603.06213v12 citationsh-index: 1Has Code
Predicted impact top 7% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses multimodal summarization for videos, transcripts, and images, offering a novel training-free approach with strong cross-domain generalization.

The paper tackles multimodal summarization by introducing CoE, a training-free framework that uses a Chain-of-Events guided by a Hierarchical Event Graph to address challenges like domain-specific supervision and weak cross-modal grounding, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore across eight datasets.

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes