CV LGFeb 6, 2024

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

arXiv:2402.03752v16.56 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient vision models for resource-constrained applications, but it is incremental as it builds on existing transformer and pre-training techniques.

The paper tackles the problem of whether lightweight Vision Transformers can outperform CNNs on small datasets with minimal image scaling, demonstrating that their method achieves state-of-the-art performance on CIFAR-10 and CIFAR-100 with models under 3.65 million parameters and 0.27G MACs.

Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.

View on arXiv PDF

Similar