Monocular Human Shape and Pose with Dense Mesh-borne Local Image Features
This work addresses a domain-specific problem in computer vision for applications like animation or AR, but it is incremental as it builds on existing graph convolution methods.
The paper tackled the problem of human shape and pose estimation from monocular images by introducing local image features per vertex instead of a single global feature, resulting in improved performance and competitive results on standard benchmarks.
We propose to improve on graph convolution based approaches for human shape and pose estimation from monocular input, using pixel-aligned local image features. Given a single input color image, existing graph convolutional network (GCN) based techniques for human shape and pose estimation use a single convolutional neural network (CNN) generated global image feature appended to all mesh vertices equally to initialize the GCN stage, which transforms a template T-posed mesh into the target pose. In contrast, we propose for the first time the idea of using local image features per vertex. These features are sampled from the CNN image feature maps by utilizing pixel-to-mesh correspondences generated with DensePose. Our quantitative and qualitative results on standard benchmarks show that using local features improves on global ones and leads to competitive performances with respect to the state-of-the-art.