VideoPose: Estimating 6D object pose from videos
This enables real-time object pose estimation for robotics and AR applications, though it is incremental as it builds on existing 2D detection and temporal methods.
The paper tackles 6D object pose estimation from videos by leveraging temporal information with a CNN-RNN architecture, achieving state-of-the-art accuracy on the YCB-Video dataset and real-time performance at 30 fps.
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos. Our approach leverages the temporal information from a video sequence, and is computationally efficient and robust to support robotic and AR domains. Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame. Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms. Further, with a speed of 30 fps, it is also more efficient than the state-of-the-art, and therefore applicable to a variety of applications that require real-time object pose estimation.