CL ASOct 7, 2022

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

Ziqiang Zhang, Long Zhou, Junyi Ao, Shujie Liu, Lirong Dai, Jinyu Li, Furu Wei

Microsoft

arXiv:2210.03730v125.2320 citationsh-index: 102Has Code

Originality Highly original

AI Analysis

This work addresses the challenge of integrating speech and text modalities for tasks like automatic speech recognition and speech translation, offering a novel pre-training approach that could benefit researchers and practitioners in speech processing.

The authors tackled the problem of cross-modal pre-training by proposing SpeechUT, a unified model that bridges speech and text using hidden units as an interface, enabling joint pre-training with unpaired data. The model achieved state-of-the-art performance on LibriSpeech ASR and MuST-C ST tasks, with substantial improvements over strong baselines.

The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT.

View on arXiv PDF Code

Similar