Unsupervised Representation Learning for Gaze Estimation
This addresses the data scarcity issue in gaze estimation for applications like human-computer interaction, though it is an incremental improvement over existing supervised methods.
The paper tackles the problem of training accurate gaze estimation models without expensive 3D gaze annotations by proposing an unsupervised method to learn a low-dimensional gaze representation, achieving competitive results with as few as 100 calibration samples.
Although automatic gaze estimation is very important to a large variety of application areas, it is difficult to train accurate and robust gaze models, in great part due to the difficulty in collecting large and diverse data (annotating 3D gaze is expensive and existing datasets use different setups). To address this issue, our main contribution in this paper is to propose an effective approach to learn a low dimensional gaze representation without gaze annotations, which to the best of our best knowledge, is the first work to do so. The main idea is to rely on a gaze redirection network and use the gaze representation difference of the input and target images (of the redirection network) as the redirection variable. A redirection loss in image domain allows the joint training of both the redirection network and the gaze representation network. In addition, we propose a warping field regularization which not only provides an explicit physical meaning to the gaze representations but also avoids redirection distortions. Promising results on few-shot gaze estimation (competitive results can be achieved with as few as <= 100 calibration samples), cross-dataset gaze estimation, gaze network pretraining, and another task (head pose estimation) demonstrate the validity of our framework.