ASLGSDMLJul 1, 2021

Pretext Tasks selection for multitask self-supervised speech representation learning

arXiv:2107.00594v513 citations
Originality Incremental advance
AI Analysis

This work addresses a domain-specific problem for researchers in speech processing by providing a more efficient alternative to computationally heavy experimental procedures for pretext task selection, though it is incremental as it builds on existing pretext task methods.

The paper tackles the problem of selecting and combining pretext tasks for multitask self-supervised speech representation learning, introducing a method that estimates calibrated weights for partial losses, which outperforms classic baselines in experiments on automatic speech recognition, speaker recognition, and emotion recognition.

Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes