Following Gaze Across Views
This work addresses the challenge of understanding human actions and intentions in videos by improving gaze tracking across views, which is incremental as it builds on existing gaze estimation methods with a new dataset and model.
The paper tackles the problem of following a person's gaze across different video views by predicting gaze locations in a second view, using a new dataset called VideoGaze for training and evaluation. The result shows that their end-to-end model outperforms standard baselines and produces plausible outcomes in everyday scenarios.
Following the gaze of people inside videos is an important signal for understanding people and their actions. In this paper, we present an approach for following gaze across views by predicting where a particular person is looking throughout a scene. We collect VideoGaze, a new dataset which we use as a benchmark to both train and evaluate models. Given one view with a person in it and a second view of the scene, our model estimates a density for gaze location in the second view. A key aspect of our approach is an end-to-end model that solves the following sub-problems: saliency, gaze pose, and geometric relationships between views. Although our model is supervised only with gaze, we show that the model learns to solve these subproblems automatically without supervision. Experiments suggest that our approach follows gaze better than standard baselines and produces plausible results for everyday situations.