CVLGFeb 6, 2024

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

arXiv:2402.03752v16 citationsh-index: 1
AI Analysis

This work addresses the challenge of efficient vision models for resource-constrained applications, but it is incremental as it builds on existing transformer and pre-training techniques.

The paper tackles the problem of whether lightweight Vision Transformers can outperform CNNs on small datasets with minimal image scaling, demonstrating that their method achieves state-of-the-art performance on CIFAR-10 and CIFAR-100 with models under 3.65 million parameters and 0.27G MACs.

Can a lightweight Vision Transformer (ViT) match or exceed the performance of Convolutional Neural Networks (CNNs) like ResNet on small datasets with small image resolutions? This report demonstrates that a pure ViT can indeed achieve superior performance through pre-training, using a masked auto-encoder technique with minimal image scaling. Our experiments on the CIFAR-10 and CIFAR-100 datasets involved ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G, qualifying them as 'lightweight' models. Unlike previous approaches, our method attains state-of-the-art performance among similar lightweight transformer-based architectures without significantly scaling up images from CIFAR-10 and CIFAR-100. This achievement underscores the efficiency of our model, not only in handling small datasets but also in effectively processing images close to their original scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes