Human Action Sequence Classification
This work addresses the problem of accurately classifying and localizing human actions in videos for applications like video captioning and action segmentation, representing a strong specific gain rather than a foundational advancement.
The paper tackles human action sequence classification from videos by using a machine translation model to output chronological action sequences, achieving state-of-the-art results: on the Charades dataset, it improved BLEU-4 from 18.8 to 34.8 and METEOR from 19.5 to 33.6 for video captioning, and for action localization, it reached 22.2 mAP without explicit annotations.
This paper classifies human action sequences from videos using a machine translation model. In contrast to classical human action classification which outputs a set of actions, our method output a sequence of action in the chronological order of the actions performed by the human. Therefore our method is evaluated using sequential performance measures such as Bilingual Evaluation Understudy (BLEU) scores. Action sequence classification has many applications such as learning from demonstration, action segmentation, detection, localization and video captioning. Furthermore, we use our model that is trained to output action sequences to solve downstream tasks; such as video captioning and action localization. We obtain state of the art results for video captioning in challenging Charades dataset obtaining BLEU-4 score of 34.8 and METEOR score of 33.6 outperforming previous state-of-the-art of 18.8 and 19.5 respectively. Similarly, on ActivityNet captioning, we obtain excellent results in-terms of ROUGE (20.24) and CIDER (37.58) scores. For action localization, without using any explicit start/end action annotations, our method obtains localization performance of 22.2 mAP outperforming prior fully supervised methods.