Pre-training Vision Transformers with Very Limited Synthesized Images
This work addresses the challenge of data efficiency in pre-training for computer vision, offering a more resource-friendly approach that could benefit researchers and practitioners with limited data access.
The paper tackles the problem of pre-training vision transformers by proposing a method that uses only one synthetic image per category instead of multiple instances, achieving comparable or better performance than models pre-trained on much larger datasets like ImageNet-21k with far fewer images (21k vs. 14M).
Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.