CVSep 21, 2023

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

arXiv:2309.11707v122 citationsh-index: 51
Originality Incremental advance
AI Analysis

This work addresses the problem of real-time unsupervised video object segmentation for video analysis applications, representing an incremental improvement by enhancing efficiency and context usage.

The paper tackles unsupervised video object segmentation by proposing an efficient Long-Short Temporal Attention network (LSTA) that captures long-term and short-term pixel relations to model appearance and motion patterns, achieving promising performance with nearly linear time complexity for real-time inference.

Unsupervised Video Object Segmentation (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge. However, previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view. Specifically, LSTA consists of two dominant modules, i.e., Long Temporal Memory and Short Temporal Attention. The former captures the long-term global pixel relations of the past frames and the current frame, which models constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals the short-term local pixel relations of one nearby frame and the current frame, which models moving objects by encoding motion pattern. To speedup the inference, the efficient projection and the locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated promising performances of the proposed method with high efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes