CLLGSDASMay 2, 2022

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

DeepMind
arXiv:2205.01086v146 citationsh-index: 83
Originality Highly original
AI Analysis

This addresses the challenge of pre-training both encoder and decoder components in speech-to-text models, offering a low-cost solution that benefits tasks like ASR, spoken named entity recognition, and translation, though it is incremental as it builds on existing encoder-decoder frameworks.

The paper tackles the problem of pre-training encoder-decoder models for speech data by introducing Wav2Seq, a self-supervised method that uses pseudo languages to transcribe audio into pseudo subword sequences, resulting in new state-of-the-art results for end-to-end spoken named entity recognition and consistent improvements on 20 language pairs for speech-to-text translation.

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes