CL AI SD ASNov 14, 2022

MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets

Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie Chen

arXiv:2211.07321v33.021 citationsh-index: 28Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing speech representation learning for applications like speech recognition, though it appears incremental as it builds on existing self-supervised methods.

The paper tackles the problem of improving self-supervised speech representation learning by proposing MT4SSL, a multi-tasking framework that integrates multiple targets using offline and online extractors, resulting in outperforming previous methods on the LibriSpeech benchmark with nontrivial margins and comparable or better performance with fewer data.

In this paper, we provide a new perspective on self-supervised speech models from how the training targets are obtained. We generalize the targets extractor into Offline Targets Extractor (Off-TE) and Online Targets Extractor (On-TE). Based on this, we propose a new multi-tasking learning framework for self-supervised learning, MT4SSL, which stands for Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets. MT4SSL uses the K-means algorithm as an Off-TE and a teacher network without gradients as an On-TE, respectively. Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark, and is comparable to or even better than the best-performing models with fewer data. Furthermore, we find that using both Off-TE and On-TE results in better convergence in the pre-training phase. With both effectiveness and efficiency, we think doing multi-task learning on self-supervised speech models from our perspective is a promising trend.

View on arXiv PDF Code

Similar