CVLGROJun 23, 2022

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

arXiv:2206.11895v417 citationsh-index: 49Has Code
Originality Incremental advance
AI Analysis

This addresses the issue of viewpoint invariance in visual understanding for computer vision applications, representing an incremental advancement by enhancing existing Transformer architectures.

The paper tackles the problem of computer vision models failing to generalize over novel camera viewpoints by proposing a 3D Token Representation Layer (3DTRL) that learns viewpoint-agnostic representations, resulting in improved performance in tasks like image classification, multi-view video alignment, and action recognition with minimal added computation.

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes