CVAIMar 3, 2022

Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work

arXiv:2203.01536v576 citationsh-index: 4Has Code
Originality Synthesis-oriented
AI Analysis

It synthesizes recent advances in ViTs for researchers in computer vision, but it is incremental as a survey paper.

This paper provides a comprehensive survey of Vision Transformers (ViTs), comparing their performance to Convolutional Neural Networks (CNNs) on popular benchmark datasets and discussing strengths, weaknesses, and computational costs.

Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs). As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships. In this paper, we begin by introducing the fundamental concepts and background of the self-attention mechanism. Next, we provide a comprehensive overview of recent top-performing ViT methods describing in terms of strength and weakness, computational cost as well as training and testing dataset. We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets. Finally, we explore some limitations with insightful observations and provide further research direction. The project page along with the collections of papers are available at https://github.com/khawar512/ViT-Survey

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes