CVAIOct 7, 2025

RGBD Gaze Tracking Using Transformer for Feature Fusion

arXiv:2510.06298v1
Originality Synthesis-oriented
AI Analysis

This work addresses gaze estimation for human-computer interaction by combining RGBD data with Transformers, but it is incremental as it builds on existing architectures and shows mixed results.

This thesis tackles gaze tracking using RGBD images and a Transformer for feature fusion, achieving a mean Euclidean error of 30.1mm without a pre-trained GAN module and 3.26° mean angular error on benchmark datasets, though it underperforms compared to state-of-the-art methods.

Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59° and without Transformer module 3.26°, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04°. On the OTH-Gaze-Estimation dataset created for...

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes