Joint Viewpoint and Keypoint Estimation with Real and Synthetic Data
This work addresses the time-consuming annotation of object keypoints for computer vision applications, offering an incremental improvement by combining tasks and using synthetic data.
The paper tackles the problem of jointly estimating object viewpoints and keypoints by proposing a convolutional neural network that leverages their correlation to improve accuracy for both tasks, and it introduces a synthetic dataset to address annotation challenges, with experiments showing it outperforms independent training methods.
The estimation of viewpoints and keypoints effectively enhance object detection methods by extracting valuable traits of the object instances. While the output of both processes differ, i.e., angles vs. list of characteristic points, they indeed share the same focus on how the object is placed in the scene, inducing that there is a certain level of correlation between them. Therefore, we propose a convolutional neural network that jointly computes the viewpoint and keypoints for different object categories. By training both tasks together, each task improves the accuracy of the other. Since the labelling of object keypoints is very time consuming for human annotators, we also introduce a new synthetic dataset with automatically generated viewpoint and keypoints annotations. Our proposed network can also be trained on datasets that contain viewpoint and keypoints annotations or only one of them. The experiments show that the proposed approach successfully exploits this implicit correlation between the tasks and outperforms previous techniques that are trained independently.