Understanding Why ViT Trains Badly on Small Datasets: An Intuitive Perspective
This addresses a practical problem for researchers and practitioners using ViTs in data-limited scenarios, but it is incremental as it provides intuitive analysis rather than a new solution.
The paper investigates why Vision Transformers (ViT) perform worse than ResNet-18 on small datasets, finding that ViT's representations differ significantly from those trained on large datasets, leading to a drop in accuracy.
Vision transformer (ViT) is an attention neural network architecture that is shown to be effective for computer vision tasks. However, compared to ResNet-18 with a similar number of parameters, ViT has a significantly lower evaluation accuracy when trained on small datasets. To facilitate studies in related fields, we provide a visual intuition to help understand why it is the case. We first compare the performance of the two models and confirm that ViT has less accuracy than ResNet-18 when trained on small datasets. We then interpret the results by showing attention map visualization for ViT and feature map visualization for ResNet-18. The difference is further analyzed through a representation similarity perspective. We conclude that the representation of ViT trained on small datasets is hugely different from ViT trained on large datasets, which may be the reason why the performance drops a lot on small datasets.