StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation
This work addresses the problem of anticipating object interactions in egocentric videos for applications like robotics and AR, representing an incremental improvement over existing methods.
The paper tackles the short-term object interaction anticipation problem from an egocentric viewpoint by proposing StillFast, an end-to-end architecture that processes a still image and video to detect next-active objects, predict interaction verbs, and determine start times, achieving state-of-the-art results on the EGO4D dataset and ranking first in the 2022 challenge.
Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.