Long-Term Anticipation of Activities with Cycle Consistency
This work addresses the challenge of predicting sequences of future activities over minutes for video analysis applications, representing an incremental improvement over existing methods.
The paper tackles the problem of long-term activity anticipation in videos by proposing an end-to-end framework that directly predicts future activities from observed frame features and uses a cycle consistency loss. It achieves state-of-the-art results on the Breakfast and 50Salads datasets.
With the success of deep learning methods in analyzing activities in videos, more attention has recently been focused towards anticipating future activities. However, most of the work on anticipation either analyzes a partially observed activity or predicts the next action class. Recently, new approaches have been proposed to extend the prediction horizon up to several minutes in the future and that anticipate a sequence of future activities including their durations. While these works decouple the semantic interpretation of the observed sequence from the anticipation task, we propose a framework for anticipating future activities directly from the features of the observed frames and train it in an end-to-end fashion. Furthermore, we introduce a cycle consistency loss over time by predicting the past activities given the predicted future. Our framework achieves state-of-the-art results on two datasets: the Breakfast dataset and 50Salads.