CL ASSep 17, 2019

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

Chengyi Wang, Yu Wu, Shujie Liu, Zhenglu Yang, Ming Zhou

arXiv:1909.07575v36.191 citations

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in speech translation for researchers and practitioners, though it appears incremental as it builds on existing pre-training methods.

The paper tackles the gap between pre-training and fine-tuning in end-to-end speech translation by proposing a Tandem Connectionist Encoding Network (TCEN) that reuses subnets, maintains role consistency, and pre-trains attention, achieving a 2.2 BLEU improvement over baselines on a large benchmark dataset.

End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model. Conventional approaches employ multi-task learning and pre-training methods for this task, but they suffer from the huge gap between pre-training and fine-tuning. To address these issues, we propose a Tandem Connectionist Encoding Network (TCEN) which bridges the gap by reusing all subnets in fine-tuning, keeping the roles of subnets consistent, and pre-training the attention module. Furthermore, we propose two simple but effective methods to guarantee the speech encoder outputs and the MT encoder inputs are consistent in terms of semantic representation and sequence length. Experimental results show that our model outperforms baselines 2.2 BLEU on a large benchmark dataset.

View on arXiv PDF

Similar