ASCLCVLGSDFeb 7, 2022

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

arXiv:2202.03543v230 citations
AI Analysis

This work addresses the challenge of unsupervised speech representation learning for tasks like zero-resource speech processing, though it is incremental as it builds on an existing model.

The authors tackled the problem of learning speech representations without transcriptions by extending the FaST-VGS model with a multi-task approach combining visual grounding and masked language modeling, achieving competitive or top results on ZeroSpeech 2021 tasks and strong performance on the SUPERB benchmark, sometimes outperforming wav2vec2.0.

In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes