CVApr 8

Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

arXiv:2604.0678352.7h-index: 5Has Code
Predicted impact top 66% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses a bottleneck in video understanding for computer vision applications, offering a novel method to improve efficiency and performance, though it is incremental relative to existing Transformer-based approaches.

The paper tackles the problem of capturing motion and long-range dependencies in video tasks by proposing a dual-path Transformer network (OG-ReG) that mimics human visual attention through glance and gaze behavior, achieving state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 datasets.

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes