Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion
This work addresses the problem of recognizing actions in videos for computer vision applications, representing an incremental improvement over existing methods.
The paper tackles action recognition in computer vision by proposing a deep learning framework that improves accuracy through coarse-to-fine feature integration and asynchronous fusion, achieving state-of-the-art performance on benchmarks.
Action recognition is an important yet challenging task in computer vision. In this paper, we propose a novel deep-based framework for action recognition, which improves the recognition accuracy by: 1) deriving more precise features for representing actions, and 2) reducing the asynchrony between different information streams. We first introduce a coarse-to-fine network which extracts shared deep features at different action class granularities and progressively integrates them to obtain a more accurate feature representation for input actions. We further introduce an asynchronous fusion network. It fuses information from different streams by asynchronously integrating stream-wise features at different time points, hence better leveraging the complementary information in different streams. Experimental results on action recognition benchmarks demonstrate that our approach achieves the state-of-the-art performance.