MABViT -- Modified Attention Block Enhances Vision Transformers
This addresses a specific bottleneck in Vision Transformers for image classification tasks, offering an incremental improvement in efficiency and accuracy.
The paper tackled the performance decline when running MLP and attention blocks in parallel for image classification by integrating non-linearity within the attention block, resulting in a 0.6% improvement over the S/16 variant of Vision Transformers on ImageNet-1K with fewer parameters and surpassing the B/16 variant using half the parameters.
Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also supersedes the B/16 variant while using only half the parameters. Furthermore, we provide results with the GELU activation function variant to confirm our assertions. Lastly, we showcase that the MABViT variants exhibit greater potential when utilized in deep transformers compared to the standard architecture.