CVAug 30, 2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

arXiv:2108.13002v293 citationsHas Code
AI Analysis

This study provides insights into network structure trade-offs for computer vision researchers, though it is incremental as it builds on existing architectures.

The paper empirically compares CNN, Transformer, and MLP architectures for computer vision using a unified framework, finding that all achieve competitive performance at moderate scales but differ when scaled up, and proposes hybrid models that achieve 83.9% top-1 accuracy with 63M parameters.

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes