CVDec 11, 2018

Learning Discriminative Motion Features Through Detection

arXiv:1812.04172v117 citations
Originality Highly original
AI Analysis

This addresses the challenge of motion representation in video analysis for computer vision researchers, offering a novel method to enhance detection models for video tasks.

The paper tackles the problem of enabling detection models like Faster R-CNN to learn motion features directly from video, which they lack for video analysis, by proposing a training scheme that uses deformable convolutions across frames to predict human pose in one frame using features from another, resulting in improved pose detection, keypoint tracking, and applications in action localization and recognition.

Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such as Faster R-CNN to learn motion features directly from the RGB video data while being optimized with respect to a pose estimation task. Given a pair of video frames---Frame A and Frame B---we force our model to predict human pose in Frame A using the features from Frame B. We do so by leveraging deformable convolutions across space and time. Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A. This naturally encourages our network to learn motion offsets encoding the spatial correspondences between the two frames. We refer to these motion offsets as DiMoFs (Discriminative Motion Features). In our experiments we show that our training scheme helps learn effective motion cues, which can be used to estimate and localize salient human motion. Furthermore, we demonstrate that as a byproduct, our model also learns features that lead to improved pose detection in still-images, and better keypoint tracking. Finally, we show how to leverage our learned model for the tasks of spatiotemporal action localization and fine-grained action recognition.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes