INeRF: Inverting Neural Radiance Fields for Pose Estimation
This work addresses the problem of 6DoF pose estimation for cameras relative to 3D objects or scenes, particularly for scenarios where object mesh models are unavailable, which is significant for robotics and augmented reality applications.
This paper introduces iNeRF, a mesh-free pose estimation framework that inverts a Neural Radiance Field (NeRF) to determine camera translation and rotation from a single RGB image. The method uses gradient descent to minimize the difference between pixels rendered from a NeRF and observed image pixels, demonstrating its ability to improve NeRF performance on complex real-world scenes and perform category-level object pose estimation.
We present iNeRF, a framework that performs mesh-free pose estimation by "inverting" a Neural RadianceField (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis - synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis via NeRF for mesh-free, RGB-only 6DoF pose estimation - given an image, find the translation and rotation of a camera relative to a 3D object or scene. Our method assumes that no object mesh models are available during either training or test time. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from a NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can perform category-level object pose estimation, including object instances not seen during training, with RGB images by inverting a NeRF model inferred from a single view.