Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data
This work improves speech translation performance for applications requiring multilingual audio processing, but it is incremental as it builds on existing pre-trained systems.
The authors tackled the problem of adapting pre-trained speech translation systems to leverage abundant source-language text data, achieving new state-of-the-art results on MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method to build a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.