CVNov 27, 2024

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

arXiv:2411.18042v218 citationsh-index: 9CVPR
Originality Highly original
AI Analysis

This addresses the challenge of video scene understanding for AI systems, offering a novel approach to handle complex multi-object interactions, though it is incremental in advancing multimodal LLMs for video tasks.

The paper tackles the problem of understanding complex multi-object interactions in video scenes by proposing HyperGLM, a method that integrates entity scene graphs and procedural graphs into a unified HyperGraph for reasoning with multimodal LLMs, and it outperforms state-of-the-art methods across five tasks on a new dataset with 1.9M frames.

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes