MV-RoMa: From Pairwise Matching into Multi-View Track Reconstruction
For 3D vision tasks like structure-from-motion, MV-RoMa addresses the fragmentation and inconsistency of pairwise matching by providing denser and more accurate multi-view correspondences.
MV-RoMa is a multi-view dense matching model that jointly estimates correspondences across multiple images, producing geometrically consistent tracks that improve 3D reconstruction quality over pairwise methods.
Establishing consistent correspondences across images is essential for 3D vision tasks such as structure-from-motion (SfM), yet most existing matchers operate in a pairwise manner, often producing fragmented and geometrically inconsistent tracks when their predictions are chained across views. We propose MV-RoMa, a multi-view dense matching model that jointly estimates dense correspondences from a source image to multiple co-visible targets. Specifically, we design an efficient model architecture which avoids high computational cost of full cross-attention for multi-view feature interaction: (i) multi-view encoder that leverages pair-wise matching results as a geometric prior, and (ii) multi-view matching refiner that refines correspondences using pixel-wise attention. Additionally, we propose a post-processing strategy that integrates our model's consistent multi-view correspondences as high-quality tracks for SfM. Across diverse and challenging benchmarks, MV-RoMa produces more reliable correspondences and substantially denser, more accurate 3D reconstructions than existing sparse and dense matching methods. Project page: https://icetea-cv.github.io/mv-roma/.