CLAILGDec 26, 2019

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

arXiv:1912.11739v2999 citations
Originality Incremental advance
AI Analysis

This addresses the data scarcity problem for researchers and practitioners working on spoken language translation of lectures, though it is an incremental improvement in domain adaptation methods.

The authors tackled the lack of parallel corpora for lectures translation by developing a language-independent framework to mine such data from Coursera lectures, extracting about 40,000 Japanese-English sentence pairs, and showed that using this mined corpus with multistage fine-tuning significantly improves translation quality.

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes