CVMar 9, 2020

On Compositions of Transformations in Contrastive Self-Supervised Learning

Mandela Patrick, Yuki M. Asano, Polina Kuznetsova, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi

arXiv:2003.04298v328.689 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of learning effective representations from complex data like videos by systematically handling transformations, which is important for advancing self-supervised learning in computer vision and multimedia domains, though it builds incrementally on existing contrastive methods.

The paper tackles the problem of generalizing contrastive self-supervised learning to handle compositions of transformations, where invariance or distinctiveness is required, by proposing a formal framework and practical construction. It shows that this approach improves video representation learning, surpassing supervised pretraining and achieving state-of-the-art results on multiple benchmarks by a large margin.

In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations, and their compositions, for which either invariance or distinctiveness is sought. We show that it is not immediately obvious how existing methods such as SimCLR can be extended to do so. Instead, we introduce a number of formal requirements that all contrastive formulations must satisfy, and propose a practical construction which satisfies these requirements. In order to maximise the reach of this analysis, we express all components of noise contrastive formulations as the choice of certain generalized transformations of the data (GDTs), including data sampling. We then consider videos as an example of data in which a large variety of transformations are applicable, accounting for the extra modalities -- for which we analyze audio and text -- and the dimension of time. We find that being invariant to certain transformations and distinctive to others is critical to learning effective video representations, improving the state-of-the-art for multiple benchmarks by a large margin, and even surpassing supervised pretraining.

View on arXiv PDF Code

Similar