Vision Transformers: State of the Art and Research Challenges
This is an incremental review paper that synthesizes existing research to identify open challenges for researchers in computer vision.
The paper provides a comprehensive overview of vision transformers, detailing their architecture designs and training techniques for various computer vision tasks, but does not present new experimental results or concrete numbers.
Transformers have achieved great success in natural language processing. Due to the powerful capability of self-attention mechanism in transformers, researchers develop the vision transformers for a variety of computer vision tasks, such as image recognition, object detection, image segmentation, pose estimation, and 3D reconstruction. This paper presents a comprehensive overview of the literature on different architecture designs and training tricks (including self-supervised learning) for vision transformers. Our goal is to provide a systematic review with the open research opportunities.