Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
This addresses the problem of understanding complex video scenes for applications like robotics and autonomous driving, representing an incremental improvement over existing methods.
The paper tackles dynamic scene graph generation from videos by capturing long-term temporal dependencies, and the proposed DSG-DETR method significantly outperforms state-of-the-art methods on the Action Genome benchmark dataset.
Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Our ablation studies validate the effectiveness of each component of the proposed approach. The source code is available at https://github.com/Shengyu-Feng/DSG-DETR.