CVOct 1, 2023

Sharingan: A Transformer-based Architecture for Gaze Following

arXiv:2310.00816v13 citationsh-index: 48
Originality Incremental advance
AI Analysis

This work addresses gaze following for applications in robotics and sociology, representing an incremental improvement over prior CNN-based methods.

The paper tackles the problem of predicting where a person in an image is looking by introducing a transformer-based architecture, achieving state-of-the-art results on GazeFollow and VideoAttentionTarget datasets.

Gaze is a powerful form of non-verbal communication and social interaction that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, Gaze Following is defined as the prediction of the pixel-wise 2D location where a person in the image is looking. Prior efforts in this direction have focused primarily on CNN-based architectures to perform the task. In this paper, we introduce a novel transformer-based architecture for 2D gaze prediction. We experiment with 2 variants: the first one retains the same task formulation of predicting a gaze heatmap for one person at a time, while the second one casts the problem as a 2D point regression and allows us to perform multi-person gaze prediction with a single forward pass. This new architecture achieves state-of-the-art results on the GazeFollow and VideoAttentionTarget datasets. The code for this paper will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes