Enhancing compact convolutional transformers with super attention
This work addresses the need for efficient and high-performing vision models in fixed context-length scenarios, offering an incremental improvement over existing transformer-based methods.
The paper tackles the problem of improving vision models for fixed context-length tasks by proposing a model that combines token mixing, sequence-pooling, and convolutional tokenizers, achieving state-of-the-art performance with top 1% and top 5% validation accuracy improvements from 36.50% to 46.29% and 66.33% to 76.31% on CIFAR100, while being more efficient and smaller than SDPA transformers.
In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.