CV LGJul 10, 2019

Video Action Recognition Via Neural Architecture Searching

arXiv:1907.04632v115.939 citations

Originality Incremental advance

AI Analysis

This work addresses the need for efficient and automated neural architecture design in video analysis, offering a novel method that reduces computational burden and parameter count, though it is incremental in improving existing search techniques.

The paper tackles the problem of automatically designing neural networks for video action recognition by introducing a differentiable spatio-temporal network search with a temporal segment approach and pseudo 3D operators to reduce computational cost. The result is an architecture that outperforms popular manual designs on UCF101 with only about 1% of the parameters.

Deep neural networks have achieved great success for video analysis and understanding. However, designing a high-performance neural architecture requires substantial efforts and expertise. In this paper, we make the first attempt to let algorithm automatically design neural networks for video action recognition tasks. Specifically, a spatio-temporal network is developed in a differentiable space modeled by a directed acyclic graph, thus a gradient-based strategy can be performed to search an optimal architecture. Nonetheless, it is computationally expensive, since the computational burden to evaluate each architecture candidate is still heavy. To alleviate this issue, we, for the video input, introduce a temporal segment approach to reduce the computational cost without losing global video information. For the architecture, we explore in an efficient search space by introducing pseudo 3D operators. Experiments show that, our architecture outperforms popular neural architectures, under the training from scratch protocol, on the challenging UCF101 dataset, surprisingly, with only around one percentage of parameters of its manual-design counterparts.

View on arXiv PDF

Similar