CVMay 31, 2020

In the Eye of the Beholder: Gaze and Actions in First Person Video

arXiv:2006.00626v2104 citations
AI Analysis

This work addresses the problem of understanding human behavior in first-person vision for applications like assistive technology or robotics, though it is incremental as it builds on existing datasets and methods.

The paper tackles the joint task of determining a person's actions and gaze from headworn camera video by introducing the EGTEA Gaze+ dataset and proposing a novel deep model that uses stochastic units to model gaze distribution and guide action recognition, achieving state-of-the-art performance with significant improvements on this dataset and also on the larger EPIC-Kitchens dataset without gaze data.

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset---EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes