CVAIHCLGROOct 24, 2022

Video based Object 6D Pose Estimation using Transformers

arXiv:2210.13540v211 citationsh-index: 22Has Code
Originality Incremental advance
AI Analysis

This work addresses real-time object pose estimation for applications like robotics and AR, but it is incremental as it builds on existing Transformer methods.

The paper tackles 6D object pose estimation in videos by introducing VideoPose, a Transformer-based framework that leverages temporal information for refinement, achieving state-of-the-art performance on the YCB-Video dataset with a speed of 33 fps.

We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://github.com/ApoorvaBeedu/VideoPose

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes