CVOct 12, 2024

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

arXiv:2410.09377v12.01 citationsh-index: 26

Originality Incremental advance

AI Analysis

This addresses the challenge of generating coherent paragraph captions from videos for applications like video indexing and accessibility, representing an incremental improvement over existing methods.

The paper tackles video paragraph captioning by developing a dual graph-enhanced multimodal integration framework that constructs temporal and theme graphs to capture events and word correlations, achieving superior performance on benchmark datasets.

Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and addressing the long-tail distribution of words. The paper introduces a novel multimodal integrated caption generation framework for VPC that leverages information from various modalities and external knowledge bases. Our framework constructs two graphs: a 'video-specific' temporal graph capturing major events and interactions between multimodal information and commonsense knowledge, and a 'theme graph' representing correlations between words of a specific theme. These graphs serve as input for a transformer network with a shared encoder-decoder architecture. We also introduce a node selection module to enhance decoding efficiency by selecting the most relevant nodes from the graphs. Our results demonstrate superior performance across benchmark datasets.

View on arXiv PDF

Similar