Generalized Pose Space Embeddings for Training In-the-Wild using Anaylis-by-Synthesis
This work addresses the need for more accurate and robust pose estimation in real-world scenarios, offering an incremental improvement over existing analysis-by-synthesis methods.
The paper tackles the problem of pose estimation models suffering from many flips due to a simplistic skeleton representation, which reduces precision and inhibits 3D positioning. By introducing a more expressive intermediate skeleton representation that captures pose semantics, the approach significantly reduces flips and outperforms previous models on standard benchmarks.
Modern pose estimation models are trained on large, manually-labelled datasets which are costly and may not cover the full extent of human poses and appearances in the real world. With advances in neural rendering, analysis-by-synthesis and the ability to not only predict, but also render the pose, is becoming an appealing framework, which could alleviate the need for large scale manual labelling efforts. While recent work have shown the feasibility of this approach, the predictions admit many flips due to a simplistic intermediate skeleton representation, resulting in low precision and inhibiting the acquisition of any downstream knowledge such as three-dimensional positioning. We solve this problem with a more expressive intermediate skeleton representation capable of capturing the semantics of the pose (left and right), which significantly reduces flips. To successfully train this new representation, we extend the analysis-by-synthesis framework with a training protocol based on synthetic data. We show that our representation results in less flips and more accurate predictions. Our approach outperforms previous models trained with analysis-by-synthesis on standard benchmarks.