Long-short Term Motion Feature for Action Classification and Retrieval
This work improves video action classification and retrieval by enhancing local descriptors to better handle varying motion speeds, though it is incremental as it builds upon existing local descriptor methods.
The paper tackles the problem of representing motion in videos for classification and retrieval by addressing the limitation of fixed-size video blocks in local descriptor methods, which fail to cover actions with varying speeds. The proposed long-short term motion feature uses multiple block lengths to handle speed variance, achieving state-of-the-art results on several benchmark datasets.
We propose a method for representing motion information for video classification and retrieval. We improve upon local descriptor based methods that have been among the most popular and successful models for representing videos. The desired local descriptors need to satisfy two requirements: 1) to be representative, 2) to be discriminative. Therefore, they need to occur frequently enough in the videos and to be be able to tell the difference among different types of motions. To generate such local descriptors, the video blocks they are based on must contain just the right amount of motion information. However, current state-of-the-art local descriptor methods use video blocks with a single fixed size, which is insufficient for covering actions with varying speeds. In this paper, we introduce a long-short term motion feature that generates descriptors from video blocks with multiple lengths, thus covering motions with large speed variance. Experimental results show that, albeit simple, our model achieves state-of-the-arts results on several benchmark datasets.