PMatch: Paired Masked Image Modeling for Dense Geometric Matching
This work addresses the challenge of pretraining cross-frame modules for dense geometric matching, which is important for computer vision applications like 3D reconstruction, but it is incremental as it builds on existing masked image modeling and transformer-based methods.
The paper tackles the problem of dense geometric matching between images of the same 3D structure by introducing a paired masked image modeling pretraining method and a cross-frame global matching module, achieving state-of-the-art performance on geometric matching benchmarks.
Dense geometric matching determines the dense pixel-wise correspondence between a source and support image corresponding to the same 3D structure. Prior works employ an encoder of transformer blocks to correlate the two-frame features. However, existing monocular pretraining tasks, e.g., image classification, and masked image modeling (MIM), can not pretrain the cross-frame module, yielding less optimal performance. To resolve this, we reformulate the MIM from reconstructing a single masked image to reconstructing a pair of masked images, enabling the pretraining of transformer module. Additionally, we incorporate a decoder into pretraining for improved upsampling results. Further, to be robust to the textureless area, we propose a novel cross-frame global matching module (CFGM). Since the most textureless area is planar surfaces, we propose a homography loss to further regularize its learning. Combined together, we achieve the State-of-The-Art (SoTA) performance on geometric matching. Codes and models are available at https://github.com/ShngJZ/PMatch.