Investigation of Frame Differences as Motion Cues for Video Object Segmentation
This work addresses the need for real-time video object segmentation on resource-constrained edge devices, though it is incremental as it adapts an existing method to a new input type.
The study tackled the computational inefficiency of optical flow for motion cues in automatic video object segmentation by proposing frame differences as an alternative, achieving comparable performance to optical flow-based models, especially in videos from stationary cameras.
Automatic Video Object Segmentation (AVOS) refers to the task of autonomously segmenting target objects in video sequences without relying on human-provided annotations in the first frames. In AVOS, the use of motion information is crucial, with optical flow being a commonly employed method for capturing motion cues. However, the computation of optical flow is resource-intensive, making it unsuitable for real-time applications, especially on edge devices with limited computational resources. In this study, we propose using frame differences as an alternative to optical flow for motion cue extraction. We developed an extended U-Net-like AVOS model that takes a frame on which segmentation is performed and a frame difference as inputs, and outputs an estimated segmentation map. Our experimental results demonstrate that the proposed model achieves performance comparable to the model with optical flow as an input, particularly when applied to videos captured by stationary cameras. Our results suggest the usefulness of employing frame differences as motion cues in cases with limited computational resources.