Towards Inadequately Pre-trained Models in Transfer Learning
This challenges the assumption that better pre-trained models always transfer better, offering insights for transfer learning practitioners, though it is incremental as it builds on existing pre-training paradigms.
The paper finds that during pre-training, models at middle epochs (inadequately pre-trained) can outperform fully trained models as feature extractors, while fine-tuning performance still improves with source performance, revealing no solid positive correlation between ImageNet accuracy and transfer results. It analyzes features to explain this contradiction, showing models first learn spectral components with large singular values and residual components aid fine-tuning.
Pre-training has been a popular learning paradigm in deep learning era, especially in annotation-insufficient scenario. Better ImageNet pre-trained models have been demonstrated, from the perspective of architecture, by previous research to have better transferability to downstream tasks. However, in this paper, we found that during the same pre-training process, models at middle epochs, which is inadequately pre-trained, can outperform fully trained models when used as feature extractors (FE), while the fine-tuning (FT) performance still grows with the source performance. This reveals that there is not a solid positive correlation between top-1 accuracy on ImageNet and the transferring result on target data. Based on the contradictory phenomenon between FE and FT that better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before softmax layer to provide insightful explanations. Our discoveries suggest that, during pre-training, models tend to first learn spectral components corresponding to large singular values and the residual components contribute more when fine-tuning.