MTGLS: Multi-Task Gaze Estimation with Limited Supervision
This work addresses the challenge of gaze estimation for computer vision applications by reducing reliance on costly labeled data, though it is incremental as it builds on existing multi-task and unsupervised learning approaches.
The paper tackles the problem of robust gaze estimation with limited labeled data by proposing MTGLS, a multi-task framework that leverages non-annotated facial images and auxiliary signals, achieving performance improvements of 6.43% on CAVE and 6.59% on Gaze360 datasets compared to state-of-the-art methods.
Robust gaze estimation is a challenging task, even for deep CNNs, due to the non-availability of large-scale labeled data. Moreover, gaze annotation is a time-consuming process and requires specialized hardware setups. We propose MTGLS: a Multi-Task Gaze estimation framework with Limited Supervision, which leverages abundantly available non-annotated facial image data. MTGLS distills knowledge from off-the-shelf facial image analysis models, and learns strong feature representations of human eyes, guided by three complementary auxiliary signals: (a) the line of sight of the pupil (i.e. pseudo-gaze) defined by the localized facial landmarks, (b) the head-pose given by Euler angles, and (c) the orientation of the eye patch (left/right eye). To overcome inherent noise in the supervisory signals, MTGLS further incorporates a noise distribution modelling approach. Our experimental results show that MTGLS learns highly generalized representations which consistently perform well on a range of datasets. Our proposed framework outperforms the unsupervised state-of-the-art on CAVE (by 6.43%) and even supervised state-of-the-art methods on Gaze360 (by 6.59%) datasets.