CVAIMay 7, 2024

Space-time Reinforcement Network for Video Object Segmentation

arXiv:2405.04042v1h-index: 9ICME
Originality Incremental advance
AI Analysis

This work addresses video object segmentation for computer vision applications, offering incremental improvements over existing memory-based methods.

The paper tackles the problem of maintaining space-time coherence and avoiding mismatching in video object segmentation by generating auxiliary frames and using prototype-level matching, achieving a J&F score of 86.4% on DAVIS 2017 and 85.0% on YouTube VOS 2018 with 32+ FPS inference speed.

Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes