CVOct 1, 2023

Sharingan: A Transformer-based Architecture for Gaze Following

Samy Tafasca, Anshul Gupta, Jean-Marc Odobez

arXiv:2310.00816v12.83 citationsh-index: 48

Originality Incremental advance

AI Analysis

This work addresses gaze following for applications in robotics and sociology, representing an incremental improvement over prior CNN-based methods.

The paper tackles the problem of predicting where a person in an image is looking by introducing a transformer-based architecture, achieving state-of-the-art results on GazeFollow and VideoAttentionTarget datasets.

Gaze is a powerful form of non-verbal communication and social interaction that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, Gaze Following is defined as the prediction of the pixel-wise 2D location where a person in the image is looking. Prior efforts in this direction have focused primarily on CNN-based architectures to perform the task. In this paper, we introduce a novel transformer-based architecture for 2D gaze prediction. We experiment with 2 variants: the first one retains the same task formulation of predicting a gaze heatmap for one person at a time, while the second one casts the problem as a 2D point regression and allows us to perform multi-person gaze prediction with a single forward pass. This new architecture achieves state-of-the-art results on the GazeFollow and VideoAttentionTarget datasets. The code for this paper will be made publicly available.

View on arXiv PDF

Similar