CVMar 7, 2022

End-to-end video instance segmentation via spatial-temporal graph neural networks

arXiv:2203.03145v130 citationsh-index: 42Has Code
Originality Highly original
AI Analysis

This addresses the challenge of integrating spatial-temporal information for video instance segmentation, offering a more efficient solution for computer vision applications.

The paper tackles video instance segmentation by proposing a unified graph neural network framework that jointly optimizes detection, segmentation, and tracking, achieving 35.2% AP on the YouTubeVIS dataset with a ResNet-50 backbone at 22 FPS.

Video instance segmentation is a challenging task that extends image instance segmentation to the video domain. Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step, which limit their capability to fully leverage and share useful spatial-temporal information for all the subproblems. In this paper, we propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation. Specifically, graph nodes representing instance features are used for detection and segmentation while graph edges representing instance relations are used for tracking. Both inter and intra-frame information is effectively propagated and shared via graph updates and all the subproblems (i.e. detection, segmentation and tracking) are jointly optimized in an unified framework. The performance of our method shows great improvement on the YoutubeVIS validation dataset compared to existing methods and achieves 35.2% AP with a ResNet-50 backbone, operating at 22 FPS. Code is available at http://github.com/lucaswithai/visgraph.git .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes