GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
This addresses video understanding for AI systems, offering an incremental but interpretable approach to modeling human actions in videos.
The paper tackles video question answering by proposing GHR-VQA, a human-centric framework that uses scene graphs to capture human-object interactions, achieving a 7.3% improvement in object-relation reasoning over state-of-the-art methods on the AGQA dataset.
We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.