CVMay 8, 2018

Recurrent CNN for 3D Gaze Estimation using Appearance and Shape Cues

arXiv:1805.03064v396 citations
Originality Incremental advance
AI Analysis

This addresses gaze estimation for social signal processing and human-computer interaction, offering incremental improvements in accuracy.

The paper tackles person- and head pose-independent 3D gaze estimation from remote cameras by proposing a multi-modal recurrent CNN that combines face, eyes region, and face landmarks in still images and uses temporal sequences for dynamic gaze prediction. It achieves a 14.6% improvement over state-of-the-art on the EYEDIAP dataset, with an additional 4% gain when including temporal modality.

Gaze behavior is an important non-verbal cue in social signal processing and human-computer interaction. In this paper, we tackle the problem of person- and head pose-independent 3D gaze estimation from remote cameras, using a multi-modal recurrent convolutional neural network (CNN). We propose to combine face, eyes region, and face landmarks as individual streams in a CNN to estimate gaze in still images. Then, we exploit the dynamic nature of gaze by feeding the learned features of all the frames in a sequence to a many-to-one recurrent module that predicts the 3D gaze vector of the last frame. Our multi-modal static solution is evaluated on a wide range of head poses and gaze directions, achieving a significant improvement of 14.6% over the state of the art on EYEDIAP dataset, further improved by 4% when the temporal modality is included.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes