CV LG NEOct 25, 2023

ConvNets Match Vision Transformers at Scale

Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De

DeepMind

arXiv:2310.16764v114.531 citationsh-index: 26

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of model selection and efficiency for computer vision researchers and practitioners, showing that ConvNets remain competitive at scale, which is incremental but clarifies a key debate in the field.

The paper challenges the belief that ConvNets are inferior to Vision Transformers at scale by showing that NFNet models, pre-trained on the large JFT-4B dataset, achieve comparable performance to Vision Transformers with similar compute budgets, with a top-1 accuracy of 90.4% on ImageNet after fine-tuning.

Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

View on arXiv PDF

Similar