CVAug 30, 2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng, Zheng-Jun Zha

arXiv:2108.13002v215.593 citationsHas Code

Originality Incremental advance

AI Analysis

This study provides insights into network structure trade-offs for computer vision researchers, though it is incremental as it builds on existing architectures.

The paper empirically compares CNN, Transformer, and MLP architectures for computer vision using a unified framework, finding that all achieve competitive performance at moderate scales but differ when scaled up, and proposes hybrid models that achieve 83.9% top-1 accuracy with 63M parameters.

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.

View on arXiv PDF Code

Similar