CVLGSep 17, 2020

Discovering Dynamic Salient Regions for Spatio-Temporal Graph Neural Networks

arXiv:2009.08427v36 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of node selection in graph-based video analysis for researchers in computer vision, offering an incremental improvement over existing methods.

The paper tackles the problem of defining nodes for spatio-temporal graph neural networks in videos without explicit structure, by learning dynamic salient regions without object-level supervision, and shows superior performance on video classification tasks in experiments on two datasets.

Graph Neural Networks are perfectly suited to capture latent interactions between various entities in the spatio-temporal domain (e.g. videos). However, when an explicit structure is not available, it is not obvious what atomic elements should be represented as nodes. Current works generally use pre-trained object detectors or fixed, predefined regions to extract graph nodes. Improving upon this, our proposed model learns nodes that dynamically attach to well-delimited salient regions, which are relevant for a higher-level task, without using any object-level supervision. Constructing these localized, adaptive nodes gives our model inductive bias towards object-centric representations and we show that it discovers regions that are well correlated with objects in the video. In extensive ablation studies and experiments on two challenging datasets, we show superior performance to previous graph neural networks models for video classification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes