Variance-Covariance Regularization Improves Representation Learning
This work addresses the transfer learning bottleneck in machine learning by providing a universally applicable regularization framework that enhances feature transferability across domains.
The paper tackles the problem of conventional supervised pretraining undermining feature transferability by adapting a self-supervised learning regularization technique to supervised learning, introducing Variance-Covariance Regularization (VCReg) to encourage high-variance, low-covariance representations. The method achieves state-of-the-art performance across numerous image and video transfer learning tasks and datasets, and also improves performance in long-tail learning and hierarchical classification scenarios.
Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. In this work, we adapt a self-supervised learning regularization technique from the VICReg method to supervised learning contexts, introducing Variance-Covariance Regularization (VCReg). This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. We outline best practices for an efficient implementation of our framework, including applying it to the intermediate representations. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos, achieving state-of-the-art performance across numerous tasks and datasets. VCReg also improves performance in scenarios like long-tail learning and hierarchical classification. Additionally, we show its effectiveness may stem from its success in addressing challenges like gradient starvation and neural collapse. In summary, VCReg offers a universally applicable regularization framework that significantly advances transfer learning and highlights the connection between gradient starvation, neural collapse, and feature transferability.