CVDec 3, 2025

Diminishing Returns in Self-Supervised Learning

arXiv:2512.03862v1
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of optimizing training strategies for small-scale vision transformers, which is incremental as it builds on existing self-supervised learning methods.

The study investigated the marginal benefits of pre-training, intermediate fine-tuning, and downstream tasks on a small 5M-parameter vision transformer, finding that while pre-training and fine-tuning help with diminishing returns, intermediate fine-tuning can harm performance due to task dissimilarity.

While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes