CVMar 22, 2021

End-to-End Trainable Multi-Instance Pose Estimation with Transformers

arXiv:2103.12115v255 citations
AI Analysis

This provides a simpler and more efficient alternative for pose estimation in computer vision, applicable to both human and animal tasks.

The authors tackled multi-instance pose estimation by proposing POET, an end-to-end trainable transformer-based method that directly predicts poses as a set, achieving high accuracy on COCO with fewer parameters and faster inference than existing approaches.

We propose an end-to-end trainable approach for multi-instance pose estimation, called POET (POse Estimation Transformer). Combining a convolutional neural network with a transformer encoder-decoder architecture, we formulate multiinstance pose estimation from images as a direct set prediction problem. Our model is able to directly regress the pose of all individuals, utilizing a bipartite matching scheme. POET is trained using a novel set-based global loss that consists of a keypoint loss, a visibility loss and a class loss. POET reasons about the relations between multiple detected individuals and the full image context to directly predict their poses in parallel. We show that POET achieves high accuracy on the COCO keypoint detection task while having less parameters and higher inference speed than other bottom-up and top-down approaches. Moreover, we show successful transfer learning when applying POET to animal pose estimation. To the best of our knowledge, this model is the first end-to-end trainable multi-instance pose estimation method and we hope it will serve as a simple and promising alternative.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes