Towards Better Accuracy-efficiency Trade-offs: Divide and Co-training
This work addresses the problem of improving accuracy and efficiency in neural networks for machine learning practitioners by demonstrating that increasing the number of networks can be more effective than solely increasing network width.
This paper proposes dividing a large neural network into several smaller ones and co-training them to achieve better accuracy-efficiency trade-offs. The small networks, when ensembled, achieve better performance than the original large network with few or no extra parameters or FLOPs, and can also offer faster inference speed through concurrent running.
The width of a neural network matters since increasing the width will necessarily increase the model capacity. However, the performance of a network does not improve linearly with the width and soon gets saturated. In this case, we argue that increasing the number of networks (ensemble) can achieve better accuracy-efficiency trade-offs than purely increasing the width. To prove it, one large network is divided into several small ones regarding its parameters and regularization components. Each of these small networks has a fraction of the original one's parameters. We then train these small networks together and make them see various views of the same data to increase their diversity. During this co-training process, networks can also learn from each other. As a result, small networks can achieve better ensemble performance than the large one with few or no extra parameters or FLOPs, \ie, achieving better accuracy-efficiency trade-offs. Small networks can also achieve faster inference speed than the large one by concurrent running. All of the above shows that the number of networks is a new dimension of model scaling. We validate our argument with 8 different neural architectures on common benchmarks through extensive experiments. The code is available at \url{https://github.com/FreeformRobotics/Divide-and-Co-training}.