Revisiting the Critical Factors of Augmentation-Invariant Representation Learning
This work provides incremental insights for researchers in self-supervised learning by clarifying factors affecting representation quality in specific frameworks.
The paper investigates critical factors in augmentation-invariant representation learning by establishing a fair benchmark to compare MoCo v2 and BYOL, finding that model configurations and optimization mismatches affect transfer performance, with asymmetry in network structure benefiting linear evaluation but harming long-tailed classification tasks.
We focus on better understanding the critical factors of augmentation-invariant representation learning. We revisit MoCo v2 and BYOL and try to prove the authenticity of the following assumption: different frameworks bring about representations of different characteristics even with the same pretext task. We establish the first benchmark for fair comparisons between MoCo v2 and BYOL, and observe: (i) sophisticated model configurations enable better adaptation to pre-training dataset; (ii) mismatched optimization strategies of pre-training and fine-tuning hinder model from achieving competitive transfer performances. Given the fair benchmark, we make further investigation and find asymmetry of network structure endows contrastive frameworks to work well under the linear evaluation protocol, while may hurt the transfer performances on long-tailed classification tasks. Moreover, negative samples do not make models more sensible to the choice of data augmentations, nor does the asymmetric network structure. We believe our findings provide useful information for future work.