CVDec 3, 2025

Diminishing Returns in Self-Supervised Learning

Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu

arXiv:2512.03862v13.6

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of optimizing training strategies for small-scale vision transformers, which is incremental as it builds on existing self-supervised learning methods.

The study investigated the marginal benefits of pre-training, intermediate fine-tuning, and downstream tasks on a small 5M-parameter vision transformer, finding that while pre-training and fine-tuning help with diminishing returns, intermediate fine-tuning can harm performance due to task dissimilarity.

While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.

View on arXiv PDF

Similar