Consensus Learning with Deep Sets for Essential Matrix Estimation
This is an incremental improvement for computer vision tasks like structure from motion, offering a simpler solution to a known bottleneck in camera pose estimation.
The paper tackled robust essential matrix estimation by proposing a simpler Deep Sets-based network that identifies outliers and models inlier noise, achieving superior accuracy compared to more complex existing networks.
Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.