CLASSep 17, 2019

Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation

arXiv:1909.07575v391 citations
AI Analysis

This addresses a specific bottleneck in speech translation for researchers and practitioners, though it appears incremental as it builds on existing pre-training methods.

The paper tackles the gap between pre-training and fine-tuning in end-to-end speech translation by proposing a Tandem Connectionist Encoding Network (TCEN) that reuses subnets, maintains role consistency, and pre-trains attention, achieving a 2.2 BLEU improvement over baselines on a large benchmark dataset.

End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model. Conventional approaches employ multi-task learning and pre-training methods for this task, but they suffer from the huge gap between pre-training and fine-tuning. To address these issues, we propose a Tandem Connectionist Encoding Network (TCEN) which bridges the gap by reusing all subnets in fine-tuning, keeping the roles of subnets consistent, and pre-training the attention module. Furthermore, we propose two simple but effective methods to guarantee the speech encoder outputs and the MT encoder inputs are consistent in terms of semantic representation and sequence length. Experimental results show that our model outperforms baselines 2.2 BLEU on a large benchmark dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes