CL LG SD ASApr 27, 2023

Understanding Shared Speech-Text Representations

Gary Wang, Kyle Kastner, Ankur Bapna, Zhehuai Chen, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang

arXiv:2304.14514v13.38 citationsh-index: 51

Originality Synthesis-oriented

AI Analysis

This work provides incremental insights into improving automatic speech recognition and speech translation by analyzing shared representations, benefiting researchers in speech processing.

The paper investigates the properties of shared speech-text representations in models like Maestro, finding that a corpus-specific duration model is crucial for effective speech-text alignment and that the shared encoder produces more compact and overlapping representations than unimodal encoders.

Recently, a number of approaches to train speech models by incorpo-rating text into end-to-end models have been developed, with Mae-stro advancing state-of-the-art automatic speech recognition (ASR)and Speech Translation (ST) performance. In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. First we examine the limits of speech-free domain adaptation, finding that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation. Second, we inspect the sim-ilarities between activations of unimodal (speech or text) encodersas compared to the activations of a shared encoder. We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders. We hypothesize that thispartially explains the effectiveness of the Maestro shared speech-textrepresentations.

View on arXiv PDF

Similar