CVOct 25, 2021

MVT: Multi-view Vision Transformer for 3D Object Recognition

arXiv:2110.13083v160 citations
Originality Highly original
AI Analysis

This work addresses a limitation in multi-view CNN models for 3D object recognition, offering an incremental improvement with a novel hybrid approach.

The paper tackles the problem of 3D object recognition by proposing a Multi-view Vision Transformer (MVT) to enable communications between patches from different views, achieving competitive performance on ModelNet40 and ModelNet10 benchmarks.

Inspired by the great success achieved by CNN in image recognition, view-based methods applied CNNs to model the projected views for 3D object understanding and achieved excellent performance. Nevertheless, multi-view CNN models cannot model the communications between patches from different views, limiting its effectiveness in 3D object recognition. Inspired by the recent success gained by vision Transformer in image recognition, we propose a Multi-view Vision Transformer (MVT) for 3D object recognition. Since each patch feature in a Transformer block has a global reception field, it naturally achieves communications between patches from different views. Meanwhile, it takes much less inductive bias compared with its CNN counterparts. Considering both effectiveness and efficiency, we develop a global-local structure for our MVT. Our experiments on two public benchmarks, ModelNet40 and ModelNet10, demonstrate the competitive performance of our MVT.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes