CLNov 28, 2020

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Alberto Poncelas, Jan Buts, James Hadley, Andy Way

arXiv:2011.14190v131.0991 citations

Originality Synthesis-oriented

AI Analysis

This work provides an incremental improvement for researchers and developers working on machine translation for low-resource language pairs, specifically English-Esperanto.

This paper addresses the challenge of low-resource machine translation by proposing a method to expand available data. They achieve this by processing the same parallel sentences multiple times, varying the subword splitting using different Byte Pair Encoding models, and apply it to English-Esperanto literary translation.

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with high-resource languages. When data are scarce, it is of paramount importance to make optimal use of the limited material available. To that end, in this paper we propose employing the same parallel sentences multiple times, only changing the way the words are split each time. For this purpose we use several Byte Pair Encoding models, with various merge operations used in their configuration. In our experiments, we use this technique to expand the available data and improve an MT system involving a low-resource language pair, namely English-Esperanto. As an additional contribution, we made available a set of English-Esperanto parallel data in the literary domain.

View on arXiv PDF

Similar