Learning Rich Nearest Neighbor Representations from Self-supervised Ensembles
This work addresses a gap in self-supervised learning for image domains by introducing a method to boost representation quality through ensembling, offering potential improvements in transfer learning applications.
The paper tackles the problem of optimally combining self-supervised models to improve representation quality, proposing a framework that uses gradient descent at inference time to learn representations, resulting in enhanced performance as measured by k-nearest neighbors on both in-domain and transfer datasets.
Pretraining convolutional neural networks via self-supervision, and applying them in transfer learning, is an incredibly fast-growing field that is rapidly and iteratively improving performance across practically all image domains. Meanwhile, model ensembling is one of the most universally applicable techniques in supervised learning literature and practice, offering a simple solution to reliably improve performance. But how to optimally combine self-supervised models to maximize representation quality has largely remained unaddressed. In this work, we provide a framework to perform self-supervised model ensembling via a novel method of learning representations directly through gradient descent at inference time. This technique improves representation quality, as measured by k-nearest neighbors, both on the in-domain dataset and in the transfer setting, with models transferable from the former setting to the latter. Additionally, this direct learning of feature through backpropagation improves representations from even a single model, echoing the improvements found in self-distillation.