MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations
This work addresses the need for flexible representations in self-supervised learning for audio domains, though it is incremental as it builds on existing contrastive methods.
The paper tackled the problem of unknown and varying invariance preferences across downstream tasks in self-supervised learning by proposing MT-SLVR, a multi-task framework that learns both variant and invariant features, resulting in improved classification performance on few-shot audio tasks.
Contrastive self-supervised learning has gained attention for its ability to create high-quality representations from large unlabelled data sets. A key reason that these powerful features enable data-efficient learning of downstream tasks is that they provide augmentation invariance, which is often a useful inductive bias. However, the amount and type of invariances preferred is not known apriori, and varies across different downstream tasks. We therefore propose a multi-task self-supervised framework (MT-SLVR) that learns both variant and invariant features in a parameter-efficient manner. Our multi-task representation provides a strong and flexible feature that benefits diverse downstream tasks. We evaluate our approach on few-shot classification tasks drawn from a variety of audio domains and demonstrate improved classification performance on all of them