CVJan 10, 2025

MSCViT: A Small-size ViT architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets

arXiv:2501.06040v214 citationsh-index: 2Neural Networks
Originality Incremental advance
AI Analysis

This addresses the issue of ViTs requiring large datasets for effective training, making them more applicable in resource-constrained or data-scarce scenarios, though it is an incremental improvement over existing methods.

The paper tackles the problem of Vision Transformers (ViTs) underperforming on small datasets by proposing MSCViT, a small-size ViT with multi-scale self-attention and convolution blocks, achieving 84.68% accuracy on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs without pre-training.

Vision Transformer (ViT) has demonstrated significant potential in various vision tasks due to its strong ability in modelling long-range dependencies. However, such success is largely fueled by training on massive samples. In real applications, the large-scale datasets are not always available, and ViT performs worse than Convolutional Neural Networks (CNNs) if it is only trained on small scale dataset (called tiny dataset), since it requires large amount of training data to ensure its representational capacity. In this paper, a small-size ViT architecture with multi-scale self-attention mechanism and convolution blocks is presented (dubbed MSCViT) to model different scales of attention at each layer. Firstly, we introduced wavelet convolution, which selectively combines the high-frequency components obtained by frequency division with our convolution channel to extract local features. Then, a lightweight multi-head attention module is developed to reduce the number of tokens and computational costs. Finally, the positional encoding (PE) in the backbone is replaced by a local feature extraction module. Compared with the original ViT, it is parameter-efficient and is particularly suitable for tiny datasets. Extensive experiments have been conducted on tiny datasets, in which our model achieves an accuracy of 84.68% on CIFAR-100 with 14.0M parameters and 2.5 GFLOPs, without pre-training on large datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes