Attention Is not Everything: Efficient Alternatives for Vision
For computer vision researchers, this provides a structured overview of alternatives to Transformers, highlighting their comparative strengths and weaknesses.
This review categorizes non-Transformer vision methods (convolution, MLP, state-space, etc.) from 40 papers, analyzing their efficiency, scalability, interpretability, and robustness to identify challenges and opportunities for future research.
Recently computer vision has seen advancements mainly thanks to Transformer-based models. However many non-Transformer methods are still doing well being a direct competition of Transformer-based models. This review tries to present a comprehensive taxonomy of such methods and organize these methods into categories like convolution-based models, MLP-based models, state-space-based and more. These methods are looked at in terms of how efficient they are, how well they scale, how easy they are to understand and how robust they are. A total of 40 papers were chosen for this study. The goal is to give a view of non-Transformer methods and find out what challenges and opportunities exist for future computer vision research.