Co-training $2^L$ Submodels for Visual Recognition
This work addresses the need for better regularization methods in training visual recognition models, offering an incremental improvement over existing techniques.
The paper tackles the problem of improving neural network training for visual recognition by introducing submodel co-training, which uses stochastic depth to create two submodels that teach each other, resulting in a ViT-B model achieving 87.4% top-1 accuracy on ImageNet-val.
We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.