Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective
This work addresses the problem of limited egocentric data for action recognition, which is incremental by improving transfer learning methods across different camera views.
The paper tackles the challenge of transferring knowledge from large-scale exocentric data to egocentric action recognition by introducing a cross-view learning approach with geometric constraints and attention losses, achieving state-of-the-art performance on benchmarks like Charades-Ego and EPIC-Kitchens.
Understanding action recognition in egocentric videos has emerged as a vital research topic with numerous practical applications. With the limitation in the scale of egocentric data collection, learning robust deep learning-based action recognition models remains difficult. Transferring knowledge learned from the large-scale exocentric data to the egocentric data is challenging due to the difference in videos across views. Our work introduces a novel cross-view learning approach to action recognition (CVAR) that effectively transfers knowledge from the exocentric to the selfish view. First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer based on analyzing the camera positions between two views. Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views. Finally, to further improve the performance of our cross-view learning approach, we present the metrics to measure the correlations in videos and attention maps effectively. Experimental results on standard egocentric action recognition benchmarks, i.e., Charades-Ego, EPIC-Kitchens-55, and EPIC-Kitchens-100, have shown our approach's effectiveness and state-of-the-art performance.