ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022
This work addresses the problem of fine-grained action localization in long, multi-instance egocentric videos for computer vision researchers, but it is incremental as it builds on existing transformer methods with a recurrence modification.
The authors tackled temporal action localization in long egocentric videos from the Ego4D dataset by using a multi-scale transformer with a segment-level recurrence mechanism to capture long-term dependencies, achieving a Recall@1,tIoU=0.5 score of 37.24 and average mAP of 17.67, placing 3rd in the Ego4D Moment Queries Challenge.
In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022. In this task, the goal is to retrieve and localize all instances of possible activities in egocentric videos. Ego4D dataset is challenging for the temporal action localization task as the temporal duration of the videos is quite long and each video contains multiple action instances with fine-grained action classes. To address these problems, we utilize a multi-scale transformer to classify different action categories and predict the boundary of each instance. Moreover, in order to better capture the long-term temporal dependencies in the long videos, we propose a segment-level recurrence mechanism. Compared with directly feeding all video features to the transformer encoder, the proposed segment-level recurrence mechanism alleviates the optimization difficulties and achieves better performance. The final submission achieved Recall@1,tIoU=0.5 score of 37.24, average mAP score of 17.67 and took 3-rd place on the leaderboard.