CVAILGNov 28, 2022

A Light Touch Approach to Teaching Transformers Multi-view Geometry

arXiv:2211.15107v26 citationsh-index: 188
Originality Incremental advance
AI Analysis

This addresses the problem of robust object retrieval under varying viewpoints for computer vision applications, representing an incremental improvement by integrating geometric guidance into existing Transformer architectures.

The paper tackles the challenge of teaching Transformers multi-view geometry for pose-invariant object instance retrieval, where standard Transformers struggle due to viewpoint variations, and achieves state-of-the-art performance without requiring camera pose information at test-time.

Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes