Efficiency 360: Efficient Vision Transformers
This work addresses efficiency challenges in vision transformers for industrial applications, but it appears incremental as it primarily reviews and categorizes existing methods without introducing new techniques.
The paper tackles the efficiency of vision transformers for image classification by introducing the Efficiency 360 framework, which categorizes efficiency aspects like privacy and robustness, and compares models based on performance, parameters, and FLOPs across datasets, though no specific numerical results are provided.
Transformers are widely used for solving tasks in natural language processing, computer vision, speech, and music domains. In this paper, we talk about the efficiency of transformers in terms of memory (the number of parameters), computation cost (number of floating points operations), and performance of models, including accuracy, the robustness of the model, and fair \& bias-free features. We mainly discuss the vision transformer for the image classification task. Our contribution is to introduce an efficient 360 framework, which includes various aspects of the vision transformer, to make it more efficient for industrial applications. By considering those applications, we categorize them into multiple dimensions such as privacy, robustness, transparency, fairness, inclusiveness, continual learning, probabilistic models, approximation, computational complexity, and spectral complexity. We compare various vision transformer models based on their performance, the number of parameters, and the number of floating point operations (FLOPs) on multiple datasets.