Rigidity Preserving Image Transformations and Equivariance in Perspective
This addresses the problem of enhancing 3D inference accuracy in computer vision for applications like robotics and AR/VR, though it is incremental as it builds on existing CNN frameworks.
The paper tackles the problem of improving 3D inference tasks like object pose estimation and visual localization by identifying that 2D translations in pinhole images are not rigidity preserving, and proposes modifying CNNs to be equivariant to rigidity preserving transformations instead. The result is experimental improvements over competitive baselines in these tasks.
We characterize the class of image plane transformations which realize rigid camera motions and call these transformations `rigidity preserving'. In particular, 2D translations of pinhole images are not rigidity preserving. Hence, when using CNNs for 3D inference tasks, it can be beneficial to modify the inductive bias from equivariance towards translations to equivariance towards rigidity preserving transformations. We investigate how equivariance with respect to rigidity preserving transformations can be approximated in CNNs, and test our ideas on both 6D object pose estimation and visual localization. Experimentally, we improve on several competitive baselines.