Pose from Action: Unsupervised Learning of Pose Features based on Motion
This work addresses the problem of learning pose representations without human supervision for computer vision researchers, offering an incremental improvement by combining motion and appearance cues in a novel way.
The paper tackles unsupervised learning of pose features from videos by using motion as a complementary signal to appearance, and the result is a model that achieves competitive performance on pose estimation and action recognition tasks, such as 90.2% accuracy on FLIC for pose estimation and 83.5% on UCF101 for action recognition.
Human actions are comprised of a sequence of poses. This makes videos of humans a rich and dense source of human poses. We propose an unsupervised method to learn pose features from videos that exploits a signal which is complementary to appearance and can be used as supervision: motion. The key idea is that humans go through poses in a predictable manner while performing actions. Hence, given two poses, it should be possible to model the motion that caused the change between them. We represent each of the poses as a feature in a CNN (Appearance ConvNet) and generate a motion encoding from optical flow maps using a separate CNN (Motion ConvNet). The data for this task is automatically generated allowing us to train without human supervision. We demonstrate the strength of the learned representation by finetuning the trained model for Pose Estimation on the FLIC dataset, for static image action recognition on PASCAL and for action recognition in videos on UCF101 and HMDB51.